Solved: Re: slow proc sort and code

Srigyan · Posted 08-16-2019 06:05 AM

I have following timeseries data

Product	DATE	price
A	01-Jan-18	10
A	02-Jan-18	10
A	03-Jan-18	10
A	04-Jan-18	10
A	05-Jan-18	10
A	06-Jan-18	10
A	07-Jan-18	3
A	08-Jan-18	11
A	10-Jan-18	11
B	15-Jan-18	15
B	16-Jan-17	15
B	17-Jan-17	15
B	20-Jan-17	15
B	21-Jan-17	19
B	22-Jan-17	19
B	23-Jan-17	19
B	24-Jan-17	18

Code

data Table1;

input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-18 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

run;

I wanted an ouput like this

Product	Start_date	End_date	Price
A	01-Jan-18	06-Jan-18	10
A	07-Jan-18	07-Jan-18	3
A	08-Jan-18	08-Jan-18	11
A	10-Jan-18	10-Jan-18	11
B	15-Jan-18	17-Jan-17	15
B	20-Jan-17	20-Jan-17	15
B	21-Jan-17	23-Jan-17	19
B	24-Jan-17	24-Jan-17	18

I am running following code


data temp;
set table1;
by product price notsorted;
retain grp;
k=dif(date);
if first.product then grp=1;
else if k ne 1 or first.price then grp+1;
drop k;
run;

proc sql;
create table want as
select distinct product,min(date) as  Start_date	format=date9.,max(date) as End_date format=date9., price
from temp
group by product, grp
order by product, start_date;
quit;

My table1 had more than 80m data, my code run fine and give the output I am looking for but it is taking enormous amount of time.

more than 7 & 8 hours.

Is there any way I can optimise it.

PeterClemmensen · Posted 08-16-2019 07:34 AM

Try this approach. Btw, you have an error in your sample data at the first obs in Product=B


data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

data temp;
    set Table1;
    if product ne lag(product) or price ne lag(price) or dif(date) gt 1 then n+1;
run;

proc summary data=temp nway;
    class n Product Price;
    var date;
    output out=want(keep=Product Price Start_date End_date) min=Start_date max=End_date;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 08-16-2019 06:13 AM

What part takes time, the PROC SORT or the code you posted?

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 08-16-2019 07:34 AM

Try this approach. Btw, you have an error in your sample data at the first obs in Product=B


data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

data temp;
    set Table1;
    if product ne lag(product) or price ne lag(price) or dif(date) gt 1 then n+1;
run;

proc summary data=temp nway;
    class n Product Price;
    var date;
    output out=want(keep=Product Price Start_date End_date) min=Start_date max=End_date;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 08-16-2019 07:39 AM

I do not see a proc sort here:

@Srigyan wrote:

I have following timeseries data

Product	DATE	price
A	01-Jan-18	10
A	02-Jan-18	10
A	03-Jan-18	10
A	04-Jan-18	10
A	05-Jan-18	10
A	06-Jan-18	10
A	07-Jan-18	3
A	08-Jan-18	11
A	10-Jan-18	11
B	15-Jan-18	15
B	16-Jan-17	15
B	17-Jan-17	15
B	20-Jan-17	15
B	21-Jan-17	19
B	22-Jan-17	19
B	23-Jan-17	19
B	24-Jan-17	18

Code

data Table1;

input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-18 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

run;

I wanted an ouput like this

Product	Start_date	End_date	Price
A	01-Jan-18	06-Jan-18	10
A	07-Jan-18	07-Jan-18	3
A	08-Jan-18	08-Jan-18	11
A	10-Jan-18	10-Jan-18	11
B	15-Jan-18	17-Jan-17	15
B	20-Jan-17	20-Jan-17	15
B	21-Jan-17	23-Jan-17	19
B	24-Jan-17	24-Jan-17	18

I am running following code


data temp;
set table1;
by product price notsorted;
retain grp;
k=dif(date);
if first.product then grp=1;
else if k ne 1 or first.price then grp+1;
drop k;
run;

proc sql;
create table want as
select distinct product,min(date) as  Start_date	format=date9.,max(date) as End_date format=date9., price
from temp
group by product, grp
order by product, start_date;
quit;

My table1 had more than 80m data, my code run fine and give the output I am looking for but it is taking enormous amount of time.

more than 7 & 8 hours.

Is there any way I can optimise it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 08-16-2019 07:43 AM

Get rid of SQL. It won't take advantage of the fact that the data is sorted. Faster:

proc summary data=temp;
by product grp notsorted;
var date;
output out=want (drop=_freq_ _type_ grp)
min=start_date max=end_date;
run;

Faster still: it would require a few revisions to the DATA step to make it produce WANT. There are a few tricky aspects to that, however.

andreas_lds · Posted 08-16-2019 07:50 AM

It has already been mentioned: if you talk about something being slow, posting full log is required to see where the problem is.

And, as has been said, too: you don't need proc sql to solve the problem a data step is all you need, is all you need ...

data want;
   set table1(rename=(Price = _Price));
   by product;

   length Start_date End_date Price 8;
   format Start_date End_date date9.;
   retain Start_date Price;

   if first.product then do;
      Start_date = DAte;
      price = _price;
   end;

   _last = lag(Date);

   if price ^= _price then do;
      End_Date = _last;
      output;
      Start_Date = Date;
      price = _price;
   end;

   if not missing(_last) and _last+1 < Date then do;
      End_date = _last;
      output;
      Start_date = date;
   end;

   if last.product then do;
      End_date = date;
      output;
   end;

   drop _:;
run;

happy weekend

Astounding · Posted 08-16-2019 09:06 AM

@andreas_lds you're on the right path.

You would probably find it easier to use this BY statement:

by product price notsorted;

Also note, guard against outputting the same observation twice. For example, what would happen if the PRICE changes on the last observation for a PRODUCT?

andreas_lds · Posted 08-16-2019 09:25 AM

@Astounding wrote:

@andreas_lds you're on the right path.

You would probably find it easier to use this BY statement:

by product price notsorted;

Also note, guard against outputting the same observation twice. For example, what would happen if the PRICE changes on the last observation for a PRODUCT?

thanks for your input, i thought about using the by-statement you mention, but what requires "notsorted" that one price appears only once in the data? Could check this myself, of course, but i am already late for leaving office 😉

novinosrin · Posted 08-16-2019 09:23 AM

Too late to the party. Hmm, Just 9:22 EDT the day seems gone. Anyways, here is my attempt

data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;


data want ;
 length product $8 Start_date End_date price 8;/*Just to maintain order of variables*/
 if _n_=1 then do;
    dcl hash H (dataset:'Table1',multidata:'y') ;
    h.definekey  ("product","price","date") ;
    h.definedone () ;
 end;
 set Table1;
 by product price notsorted date ;
 if first.date;
 do date=date by 1 while(h.check()=0);
  Start_date=min(Start_date	,date);
  End_date=max(End_date,date);
  rc=h.remove();
 end;
 if Start_date>. and End_date>. then output;
 keep product Start_date End_date price;
 format start_date end_date date9.;
run;

Srigyan · Posted 08-27-2019 11:53 AM

Thanks for putting your effort. I could have chose only one code as solution.

novinosrin · Posted 08-27-2019 11:59 AM

@Srigyan You're very welcome. It's a buffet to choose from. I am here to learn as much as you do. I am hoping to get as good as some whom I dearly follow and I hope that day isn't too far away. Have a good one. Cheers!

novinosrin · Posted 08-16-2019 10:49 AM

Temp array is convenient too

data Table1;
input product $ date date9.  price;
format date date9.;
datalines;
A 01-Jan-18 10
A 02-Jan-18 10
A 03-Jan-18 10
A 04-Jan-18 10
A 05-Jan-18 10
A 06-Jan-18 10
A 07-Jan-18 3
A 08-Jan-18 11
A 10-Jan-18 11
B 15-Jan-17 15
B 16-Jan-17 15
B 17-Jan-17 15
B 20-Jan-17 15
B 21-Jan-17 19
B 22-Jan-17 19
B 23-Jan-17 19
B 24-Jan-17 18
;

data want;
 length product $8 Start_date End_date price 8;/*Just to maintain order of variables*/
 do _n_=1 by 1 until(last.price);
  set table1;
  by product price notsorted date ;
  array t(99999) _temporary_;
  t(_n_)=date; 
 end;
 do _n_=1 to _n_;
  set table1;
  if date not in t then continue;
  do date= date by 1 while(date in t);
   Start_date=min(Start_date	,date);
   End_date=max(End_date,date);
   call missing(t(whichn(date,of t(*))));
  end;
  output;
  call missing(Start_date,end_date);
 end;
 call missing(of t(*));
 drop date;
 format start_date end_date date9.;
run;

Tom · Posted 08-16-2019 11:14 AM

So you want to collapse periods of time where the PRICE doesn't change?

Make sure the data is sorted by product and date. The use BY product price NOTSORTED and SAS will automatically detect the groups.

You could find the start and stop date yourself.

data want ;
  do until (last.price);
    set table1;
    by product price notsorted;
    if first.price then start_date=date;
  end;
  end_date=date;
  drop date;
  format start_date end_date date9.;
run;

Or let PROC SUMMARY do it for you.

proc summary data=table1 nway;
  by producet price notsorted;
  var date ;
  output out=want(drop=_type_ _freq_) min=start_date max=end_date;
run;

Try both and see which is faster. I would expect PROC SUMMARY might take less wall clock time since it should be able to run the groups in parallel, but perhaps the NOTSORTED option will prevent that.

Registration is open

SAS Training: Just a Click Away