Solved: Diff in values on lagged dates

cmemtsa · Posted 01-25-2019 02:45 AM

Hi!

I have this dataset

DATE CF AMOUNT

31-Oct-18 A1 1

31-Oct-18 A2 2

31-Oct-18 A3 3

30-Nov-18 A1 5

30-Nov-18 A2 5

30-Nov-18 A3 5

31-Dec-18 A1 10

31-Dec-18 A2 20

31-Dec-18 A3 30

and for each CF I would like to create a new table with differences in amounts between months i.e. the output

DATE CF AMOUNT

30-Nov-18 A1 4

30-Nov-18 A2 3

30-Nov-18 A3 2

31-Dec-18 A1 5

31-Dec-18 A2 15

31-Dec-18 A3 25

How can I do it with a proq sql?

Thanks!

andreas_lds · Posted 01-25-2019 03:19 AM

@Kurt_Bremser : You don't need lag, but dif

data work.want;
   set work.have;
   by cf;
   
   amount = dif(amount);
   
   if not first.cf;
run;

View solution in original post

andreas_lds · Posted 01-25-2019 02:51 AM

@cmemtsa wrote:

Hi!

I have this dataset

DATE                    CF                   AMOUNT

31-Oct-18           A1                  1

31-Oct-18            A2                  2

31-Oct-18          A3                  3

30-Nov-18           A1                  5

30-Nov-18           A2                  5

30-Nov-18           A3                  5

31-Dec-18           A1                  10

31-Dec-18          A2                  20

31-Dec-18          A3                  30

and for each CF I would like to create a new table with differences in amounts between months i.e. the output

DATE                    CF                AMOUNT

30-Nov-18           A1                    4

30-Nov-18          A2                    3

30-Nov-18           A3                     2

31-Dec-18           A1                    5

31-Dec-18          A2                   15

31-Dec-18           A3                   25

How can I do it with a proq sql?

Thanks!

Why proc sql?

Kurt_Bremser · Posted 01-25-2019 03:13 AM

Comparisons with lagged data are best done in a data step (Maxim 14):

data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18            A1                  1
31-Oct-18            A2                  2
31-Oct-18            A3                  3
30-Nov-18           A1                  5
30-Nov-18           A2                  5
30-Nov-18           A3                  5
31-Dec-18           A1                  10
31-Dec-18           A2                  20
31-Dec-18           A3                  30
;
run;

proc sort data=have;
by cf date;
run;

data want;
set have;
by cf;
_amount = lag(amount);
if not first.cf
then do;
  amount = amount - _amount;
  output;
end;
drop _amount;
run;

proc sort data=want;
by date cf;
run;

proc print data=want noobs;
run;

Result:

      date    cf    amount

2018-11-30    A1       4  
2018-11-30    A2       3  
2018-11-30    A3       2  
2018-12-31    A1       5  
2018-12-31    A2      15  
2018-12-31    A3      25

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 01-25-2019 03:19 AM

@Kurt_Bremser : You don't need lag, but dif

data work.want;
   set work.have;
   by cf;
   
   amount = dif(amount);
   
   if not first.cf;
run;

FreelanceReinh · Posted 01-25-2019 03:51 AM

@cmemtsa wrote:
How can I do it with a proq sql?

Hi @cmemtsa,

You can do it by using a self join:

proc sql;
create table want as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;

Kurt_Bremser · Posted 01-25-2019 04:49 AM

While @FreelanceReinh shows nicely how to do it in SQL, this can also be used as a nice illustration why one does not do it with SQL:

/* create some random test data of sufficient size */
data have;
format date yymmddd10.;
length cf $6;
date = '01jan2010'd;
do until (date > '31dec2019'd);
  do i = 1 to 999;
    cf = 'A' !! put(i,z5.);
    amount = int(rand('uniform') * 10);
    output;
  end;
  date = intnx('month',date,1,'b');
end;
drop i;
run;

/* SQL method */
proc sql;
create table want1 as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;

/* data step method */
proc sort data=have;
by cf date;
run;

data want2;
set have;
by cf;
amount = dif(amount);
if not first.cf;
run;

proc sort data=want2;
by date cf;
run;

The log:

37         /* create some random test data of sufficient size */
38         data have;
39         format date yymmddd10.;
40         length cf $6;
41         date = '01jan2010'd;
42         do until (date > '31dec2019'd);
43           do i = 1 to 999;
44             cf = 'A' !! put(i,z5.);
45             amount = int(rand('uniform') * 10);
46             output;
47           end;
48           date = intnx('month',date,1,'b');
49         end;
50         drop i;
51         run;

NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.02 seconds


52         
53         /* SQL method */
54         proc sql;
55         create table want1 as
56         select a.date, a.cf, a.amount-b.amount as amount
57         from have a, have b
58         where a.cf=b.cf & intck('month',b.date,a.date)=1
59         order by date, cf;
NOTE: Table WORK.WANT1 created, with 118881 rows and 3 columns.

60         quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           11.56 seconds
      cpu time            3.88 seconds
      

61         
62         /* data step method */
63         proc sort data=have;
64         by cf date;
65         run;

NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.12 seconds
      cpu time            0.03 seconds
      

66         
67         data want2;
68         set have;
69         by cf;
70         amount = dif(amount);
71         if not first.cf;
72         run;

NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.02 seconds
      

73         
74         proc sort data=want2;
75         by date cf;
76         run;

NOTE: There were 118881 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.09 seconds
      cpu time            0.03 seconds

Note that while the combined 2 sorts and the data step consumed ~ 0.3 seconds, the SQL took close to 12(!) seconds, a multiple of 40!

Things like this gave me the reason to write down Maxim 10.

Also note that this dataset have is just 3MB(!!) in size; imagine running the SQL against a typical dataset that's in the range of several GBs, you would have a runtime counted in days or even weeks (because of the required sort and search operations, the time needed grows exponentially); the utility file built by the SQL is double the size of the input dataset, BTW.

PS thanks to @andreas_lds for the improved data step code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 01-25-2019 04:57 AM

Why do you have to choose SQL ? not suitable tool for this scenario.

data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18            A1                  1
31-Oct-18            A2                  2
31-Oct-18            A3                  3
30-Nov-18           A1                  5
30-Nov-18           A2                  5
30-Nov-18           A3                  5
31-Dec-18           A1                  10
31-Dec-18           A2                  20
31-Dec-18           A3                  30
;
run;
proc sql;
create table want as
 select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
 amount-calculated _amount as want
  from have as a;
quit;

Kurt_Bremser · Posted 01-25-2019 06:46 AM

@Ksharp wrote:

proc sql;
create table want as
 select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
 amount-calculated _amount as want
  from have as a;
quit;

This code, run against my "large" dataset from my example, took 12 minutes. The SAS SQL compiler seems to have a severe problem optimizing this query.

I had to change 'e' in the intnx to 'b' because of the way I set up the example data.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 01-25-2019 07:51 AM

Totally agree. That is the reason why I prefer to data step .

Kurt,

If data organize well just like example.

want = dif3(amount);

is good enough.

Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates

Re: Diff in values on lagged dates