BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Srigyan
Quartz | Level 8

I have following timeseries data

 

 

Product	DATE	price
A	01-Jan-18	10
A	02-Jan-18	10
A	03-Jan-18	10
A	04-Jan-18	10
A	05-Jan-18	10
A	06-Jan-18	10
A	07-Jan-18	3
A	08-Jan-18	11
A	10-Jan-18	11
B	15-Jan-18	15
B	16-Jan-17	15
B	17-Jan-17	15
B	20-Jan-17	15
B	21-Jan-17	19
B	22-Jan-17	19
B	23-Jan-17	19
B	24-Jan-17	18

 

 

 Code

 

data Table1;

input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-18 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

run;

I wanted an ouput like this

 

 

 

Product	Start_date	End_date	Price
A	01-Jan-18	06-Jan-18	10
A	07-Jan-18	07-Jan-18	3
A	08-Jan-18	08-Jan-18	11
A	10-Jan-18	10-Jan-18	11
B	15-Jan-18	17-Jan-17	15
B	20-Jan-17	20-Jan-17	15
B	21-Jan-17	23-Jan-17	19
B	24-Jan-17	24-Jan-17	18

I am running following code

 

 


data temp;
set table1;
by product price notsorted;
retain grp;
k=dif(date);
if first.product then grp=1;
else if k ne 1 or first.price then grp+1;
drop k;
run;

proc sql;
create table want as
select distinct product,min(date) as  Start_date	format=date9.,max(date) as End_date format=date9., price
from temp
group by product, grp
order by product, start_date;
quit;

My table1 had more than 80m data, my code run fine and give the output I am looking for but it is taking enormous amount of time.

more than 7 & 8 hours.

 

Is there any way I can optimise it.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Try this approach. Btw, you have an error in your sample data at the first obs in Product=B

 


data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

data temp;
    set Table1;
    if product ne lag(product) or price ne lag(price) or dif(date) gt 1 then n+1;
run;

proc summary data=temp nway;
    class n Product Price;
    var date;
    output out=want(keep=Product Price Start_date End_date) min=Start_date max=End_date;
run;

View solution in original post

12 REPLIES 12
PeterClemmensen
Tourmaline | Level 20

What part takes time, the PROC SORT or the code you posted?

PeterClemmensen
Tourmaline | Level 20

Try this approach. Btw, you have an error in your sample data at the first obs in Product=B

 


data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

data temp;
    set Table1;
    if product ne lag(product) or price ne lag(price) or dif(date) gt 1 then n+1;
run;

proc summary data=temp nway;
    class n Product Price;
    var date;
    output out=want(keep=Product Price Start_date End_date) min=Start_date max=End_date;
run;
Kurt_Bremser
Super User

I do not see a proc sort here:


@Srigyan wrote:

I have following timeseries data

 

 

Product	DATE	price
A	01-Jan-18	10
A	02-Jan-18	10
A	03-Jan-18	10
A	04-Jan-18	10
A	05-Jan-18	10
A	06-Jan-18	10
A	07-Jan-18	3
A	08-Jan-18	11
A	10-Jan-18	11
B	15-Jan-18	15
B	16-Jan-17	15
B	17-Jan-17	15
B	20-Jan-17	15
B	21-Jan-17	19
B	22-Jan-17	19
B	23-Jan-17	19
B	24-Jan-17	18

 

 

 Code

 

data Table1;

input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-18 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

run;

I wanted an ouput like this

 

 

 

Product	Start_date	End_date	Price
A	01-Jan-18	06-Jan-18	10
A	07-Jan-18	07-Jan-18	3
A	08-Jan-18	08-Jan-18	11
A	10-Jan-18	10-Jan-18	11
B	15-Jan-18	17-Jan-17	15
B	20-Jan-17	20-Jan-17	15
B	21-Jan-17	23-Jan-17	19
B	24-Jan-17	24-Jan-17	18

I am running following code

 

 


data temp;
set table1;
by product price notsorted;
retain grp;
k=dif(date);
if first.product then grp=1;
else if k ne 1 or first.price then grp+1;
drop k;
run;

proc sql;
create table want as
select distinct product,min(date) as  Start_date	format=date9.,max(date) as End_date format=date9., price
from temp
group by product, grp
order by product, start_date;
quit;

My table1 had more than 80m data, my code run fine and give the output I am looking for but it is taking enormous amount of time.

more than 7 & 8 hours.

 

Is there any way I can optimise it.

 

 


 

Astounding
PROC Star
Get rid of SQL. It won't take advantage of the fact that the data is sorted. Faster:

proc summary data=temp;
by product grp notsorted;
var date;
output out=want (drop=_freq_ _type_ grp)
min=start_date max=end_date;
run;

Faster still: it would require a few revisions to the DATA step to make it produce WANT. There are a few tricky aspects to that, however.
andreas_lds
Jade | Level 19

It has already been mentioned: if you talk about something being slow, posting full log is required to see where the problem is.

 

And, as has been said, too: you don't need proc sql to solve the problem a data step is all you need, is all you need ...

 

data want;
   set table1(rename=(Price = _Price));
   by product;

   length Start_date End_date Price 8;
   format Start_date End_date date9.;
   retain Start_date Price;

   if first.product then do;
      Start_date = DAte;
      price = _price;
   end;

   _last = lag(Date);

   if price ^= _price then do;
      End_Date = _last;
      output;
      Start_Date = Date;
      price = _price;
   end;

   if not missing(_last) and _last+1 < Date then do;
      End_date = _last;
      output;
      Start_date = date;
   end;

   if last.product then do;
      End_date = date;
      output;
   end;

   drop _:;
run;

happy weekend Smiley Wink

 

Astounding
PROC Star

@andreas_lds you're on the right path.

 

You would probably find it easier to use this BY statement:

 

by product price notsorted;

 

Also note, guard against outputting the same observation twice.  For example, what would happen if the PRICE changes on the last observation for a PRODUCT?

andreas_lds
Jade | Level 19

@Astounding wrote:

@andreas_lds you're on the right path.

 

You would probably find it easier to use this BY statement:

 

by product price notsorted;

 

Also note, guard against outputting the same observation twice.  For example, what would happen if the PRICE changes on the last observation for a PRODUCT?


thanks for your input, i thought about using the by-statement you mention, but what requires "notsorted" that one price appears only once in the data? Could check this myself, of course, but i am already late for leaving office 😉

novinosrin
Tourmaline | Level 20

Too late to the party. Hmm, Just 9:22 EDT the day seems gone. Anyways, here is my attempt

 

data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;


data want ;
 length product $8 Start_date End_date price 8;/*Just to maintain order of variables*/
 if _n_=1 then do;
    dcl hash H (dataset:'Table1',multidata:'y') ;
    h.definekey  ("product","price","date") ;
    h.definedone () ;
 end;
 set Table1;
 by product price notsorted date ;
 if first.date;
 do date=date by 1 while(h.check()=0);
  Start_date=min(Start_date	,date);
  End_date=max(End_date,date);
  rc=h.remove();
 end;
 if Start_date>. and End_date>. then output;
 keep product Start_date End_date price;
 format start_date end_date date9.;
run;
Srigyan
Quartz | Level 8

Thanks for putting your effort. I could have chose only one code as solution. 

novinosrin
Tourmaline | Level 20

@Srigyan  You're very welcome. It's a buffet to choose from. I am here to learn as much as you do. I am hoping to get as good as some whom I dearly follow and I hope that day isn't too far away. Have a good one. Cheers!

novinosrin
Tourmaline | Level 20

Temp array is convenient too

 

data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

data want;
 length product $8 Start_date End_date price 8;/*Just to maintain order of variables*/
 do _n_=1 by 1 until(last.price);
  set table1;
  by product price notsorted date ;
  array t(99999) _temporary_;
  t(_n_)=date; 
 end;
 do _n_=1 to _n_;
  set table1;
  if date not in t then continue;
  do date= date by 1 while(date in t);
   Start_date=min(Start_date	,date);
   End_date=max(End_date,date);
   call missing(t(whichn(date,of t(*))));
  end;
  output;
  call missing(Start_date,end_date);
 end;
 call missing(of t(*));
 drop date;
 format start_date end_date date9.;
run;
Tom
Super User Tom
Super User

So you want to collapse periods of time where the PRICE doesn't change?

Make sure the data is sorted by product and date.  The use BY product price NOTSORTED and SAS will automatically detect the groups.

 

You could find the start and stop date yourself.

data want ;
  do until (last.price);
    set table1;
    by product price notsorted;
    if first.price then start_date=date;
  end;
  end_date=date;
  drop date;
  format start_date end_date date9.;
run;

Or let PROC SUMMARY do it for you.

proc summary data=table1 nway;
  by producet price notsorted;
  var date ;
  output out=want(drop=_type_ _freq_) min=start_date max=end_date;
run;

Try both and see which is faster.  I would expect PROC SUMMARY might take less wall clock time since it should be able to run the groups in parallel, but perhaps the NOTSORTED option will prevent that.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2539 views
  • 6 likes
  • 7 in conversation