I have following timeseries data
Product DATE price A 01-Jan-18 10 A 02-Jan-18 10 A 03-Jan-18 10 A 04-Jan-18 10 A 05-Jan-18 10 A 06-Jan-18 10 A 07-Jan-18 3 A 08-Jan-18 11 A 10-Jan-18 11 B 15-Jan-18 15 B 16-Jan-17 15 B 17-Jan-17 15 B 20-Jan-17 15 B 21-Jan-17 19 B 22-Jan-17 19 B 23-Jan-17 19 B 24-Jan-17 18
Code
data Table1;
input product $ date date9. price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-18 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;
run;
I wanted an ouput like this
Product Start_date End_date Price A 01-Jan-18 06-Jan-18 10 A 07-Jan-18 07-Jan-18 3 A 08-Jan-18 08-Jan-18 11 A 10-Jan-18 10-Jan-18 11 B 15-Jan-18 17-Jan-17 15 B 20-Jan-17 20-Jan-17 15 B 21-Jan-17 23-Jan-17 19 B 24-Jan-17 24-Jan-17 18
I am running following code
data temp;
set table1;
by product price notsorted;
retain grp;
k=dif(date);
if first.product then grp=1;
else if k ne 1 or first.price then grp+1;
drop k;
run;
proc sql;
create table want as
select distinct product,min(date) as Start_date format=date9.,max(date) as End_date format=date9., price
from temp
group by product, grp
order by product, start_date;
quit;
My table1 had more than 80m data, my code run fine and give the output I am looking for but it is taking enormous amount of time.
more than 7 & 8 hours.
Is there any way I can optimise it.
Try this approach. Btw, you have an error in your sample data at the first obs in Product=B
data Table1;
input product $ date date9. price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;
data temp;
set Table1;
if product ne lag(product) or price ne lag(price) or dif(date) gt 1 then n+1;
run;
proc summary data=temp nway;
class n Product Price;
var date;
output out=want(keep=Product Price Start_date End_date) min=Start_date max=End_date;
run;
What part takes time, the PROC SORT or the code you posted?
Try this approach. Btw, you have an error in your sample data at the first obs in Product=B
data Table1;
input product $ date date9. price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;
data temp;
set Table1;
if product ne lag(product) or price ne lag(price) or dif(date) gt 1 then n+1;
run;
proc summary data=temp nway;
class n Product Price;
var date;
output out=want(keep=Product Price Start_date End_date) min=Start_date max=End_date;
run;
I do not see a proc sort here:
@Srigyan wrote:
I have following timeseries data
Product DATE price A 01-Jan-18 10 A 02-Jan-18 10 A 03-Jan-18 10 A 04-Jan-18 10 A 05-Jan-18 10 A 06-Jan-18 10 A 07-Jan-18 3 A 08-Jan-18 11 A 10-Jan-18 11 B 15-Jan-18 15 B 16-Jan-17 15 B 17-Jan-17 15 B 20-Jan-17 15 B 21-Jan-17 19 B 22-Jan-17 19 B 23-Jan-17 19 B 24-Jan-17 18
Code
data Table1; input product $ date date9. price; format date date9.; datalines; A 01-Jan-18 10 A 02-Jan-18 10 A 03-Jan-18 10 A 04-Jan-18 10 A 05-Jan-18 10 A 06-Jan-18 10 A 07-Jan-18 3 A 08-Jan-18 11 A 10-Jan-18 11 B 15-Jan-18 15 B 16-Jan-17 15 B 17-Jan-17 15 B 20-Jan-17 15 B 21-Jan-17 19 B 22-Jan-17 19 B 23-Jan-17 19 B 24-Jan-17 18 ; run;
I wanted an ouput like this
Product Start_date End_date Price A 01-Jan-18 06-Jan-18 10 A 07-Jan-18 07-Jan-18 3 A 08-Jan-18 08-Jan-18 11 A 10-Jan-18 10-Jan-18 11 B 15-Jan-18 17-Jan-17 15 B 20-Jan-17 20-Jan-17 15 B 21-Jan-17 23-Jan-17 19 B 24-Jan-17 24-Jan-17 18I am running following code
data temp; set table1; by product price notsorted; retain grp; k=dif(date); if first.product then grp=1; else if k ne 1 or first.price then grp+1; drop k; run; proc sql; create table want as select distinct product,min(date) as Start_date format=date9.,max(date) as End_date format=date9., price from temp group by product, grp order by product, start_date; quit;
My table1 had more than 80m data, my code run fine and give the output I am looking for but it is taking enormous amount of time.
more than 7 & 8 hours.
Is there any way I can optimise it.
It has already been mentioned: if you talk about something being slow, posting full log is required to see where the problem is.
And, as has been said, too: you don't need proc sql to solve the problem a data step is all you need, is all you need ...
data want;
set table1(rename=(Price = _Price));
by product;
length Start_date End_date Price 8;
format Start_date End_date date9.;
retain Start_date Price;
if first.product then do;
Start_date = DAte;
price = _price;
end;
_last = lag(Date);
if price ^= _price then do;
End_Date = _last;
output;
Start_Date = Date;
price = _price;
end;
if not missing(_last) and _last+1 < Date then do;
End_date = _last;
output;
Start_date = date;
end;
if last.product then do;
End_date = date;
output;
end;
drop _:;
run;
happy weekend
@andreas_lds you're on the right path.
You would probably find it easier to use this BY statement:
by product price notsorted;
Also note, guard against outputting the same observation twice. For example, what would happen if the PRICE changes on the last observation for a PRODUCT?
@Astounding wrote:
@andreas_lds you're on the right path.
You would probably find it easier to use this BY statement:
by product price notsorted;
Also note, guard against outputting the same observation twice. For example, what would happen if the PRICE changes on the last observation for a PRODUCT?
thanks for your input, i thought about using the by-statement you mention, but what requires "notsorted" that one price appears only once in the data? Could check this myself, of course, but i am already late for leaving office 😉
Too late to the party. Hmm, Just 9:22 EDT the day seems gone. Anyways, here is my attempt
data Table1;
input product $ date date9. price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;
data want ;
length product $8 Start_date End_date price 8;/*Just to maintain order of variables*/
if _n_=1 then do;
dcl hash H (dataset:'Table1',multidata:'y') ;
h.definekey ("product","price","date") ;
h.definedone () ;
end;
set Table1;
by product price notsorted date ;
if first.date;
do date=date by 1 while(h.check()=0);
Start_date=min(Start_date ,date);
End_date=max(End_date,date);
rc=h.remove();
end;
if Start_date>. and End_date>. then output;
keep product Start_date End_date price;
format start_date end_date date9.;
run;
Thanks for putting your effort. I could have chose only one code as solution.
@Srigyan You're very welcome. It's a buffet to choose from. I am here to learn as much as you do. I am hoping to get as good as some whom I dearly follow and I hope that day isn't too far away. Have a good one. Cheers!
Temp array is convenient too
data Table1;
input product $ date date9. price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;
data want;
length product $8 Start_date End_date price 8;/*Just to maintain order of variables*/
do _n_=1 by 1 until(last.price);
set table1;
by product price notsorted date ;
array t(99999) _temporary_;
t(_n_)=date;
end;
do _n_=1 to _n_;
set table1;
if date not in t then continue;
do date= date by 1 while(date in t);
Start_date=min(Start_date ,date);
End_date=max(End_date,date);
call missing(t(whichn(date,of t(*))));
end;
output;
call missing(Start_date,end_date);
end;
call missing(of t(*));
drop date;
format start_date end_date date9.;
run;
So you want to collapse periods of time where the PRICE doesn't change?
Make sure the data is sorted by product and date. The use BY product price NOTSORTED and SAS will automatically detect the groups.
You could find the start and stop date yourself.
data want ;
do until (last.price);
set table1;
by product price notsorted;
if first.price then start_date=date;
end;
end_date=date;
drop date;
format start_date end_date date9.;
run;
Or let PROC SUMMARY do it for you.
proc summary data=table1 nway;
by producet price notsorted;
var date ;
output out=want(drop=_type_ _freq_) min=start_date max=end_date;
run;
Try both and see which is faster. I would expect PROC SUMMARY might take less wall clock time since it should be able to run the groups in parallel, but perhaps the NOTSORTED option will prevent that.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.