BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cmemtsa
Quartz | Level 8

Hi!

 

I have this dataset

 

DATE                    CF                   AMOUNT

31-Oct-18            A1                  1

31-Oct-18            A2                  2

31-Oct-18            A3                  3

30-Nov-18           A1                  5

30-Nov-18           A2                  5

30-Nov-18           A3                  5

31-Dec-18           A1                  10

31-Dec-18           A2                  20

31-Dec-18           A3                  30

 

and for each CF I would like to create a new table with differences in amounts between months i.e. the output

DATE                    CF                AMOUNT

30-Nov-18           A1                     4

30-Nov-18           A2                     3

30-Nov-18           A3                     2

31-Dec-18           A1                    5

31-Dec-18           A2                   15

31-Dec-18           A3                    25

 

How can I do it with a proq sql?

 

Thanks!

 

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

@Kurt_Bremser : You don't need lag, but dif

 

data work.want;
   set work.have;
   by cf;
   
   amount = dif(amount);
   
   if not first.cf;
run;

View solution in original post

8 REPLIES 8
andreas_lds
Jade | Level 19

@cmemtsa wrote:

Hi!

 

I have this dataset

 

DATE                    CF                   AMOUNT

31-Oct-18            A1                  1

31-Oct-18            A2                  2

31-Oct-18            A3                  3

30-Nov-18           A1                  5

30-Nov-18           A2                  5

30-Nov-18           A3                  5

31-Dec-18           A1                  10

31-Dec-18           A2                  20

31-Dec-18           A3                  30

 

and for each CF I would like to create a new table with differences in amounts between months i.e. the output

DATE                    CF                AMOUNT

30-Nov-18           A1                     4

30-Nov-18           A2                     3

30-Nov-18           A3                     2

31-Dec-18           A1                    5

31-Dec-18           A2                   15

31-Dec-18           A3                    25

 

How can I do it with a proq sql?

 

Thanks!

 


 

Why proc sql?

Kurt_Bremser
Super User

Comparisons with lagged data are best done in a data step (Maxim 14):

data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18            A1                  1
31-Oct-18            A2                  2
31-Oct-18            A3                  3
30-Nov-18           A1                  5
30-Nov-18           A2                  5
30-Nov-18           A3                  5
31-Dec-18           A1                  10
31-Dec-18           A2                  20
31-Dec-18           A3                  30
;
run;

proc sort data=have;
by cf date;
run;

data want;
set have;
by cf;
_amount = lag(amount);
if not first.cf
then do;
  amount = amount - _amount;
  output;
end;
drop _amount;
run;

proc sort data=want;
by date cf;
run;

proc print data=want noobs;
run;

Result:

      date    cf    amount

2018-11-30    A1       4  
2018-11-30    A2       3  
2018-11-30    A3       2  
2018-12-31    A1       5  
2018-12-31    A2      15  
2018-12-31    A3      25  

 

andreas_lds
Jade | Level 19

@Kurt_Bremser : You don't need lag, but dif

 

data work.want;
   set work.have;
   by cf;
   
   amount = dif(amount);
   
   if not first.cf;
run;
FreelanceReinh
Jade | Level 19

@cmemtsa wrote:

How can I do it with a proq sql?


Hi @cmemtsa,

 

You can do it by using a self join:

proc sql;
create table want as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;
Kurt_Bremser
Super User

While @FreelanceReinh shows nicely how to do it in SQL, this can also be used as a nice illustration why one does not do it with SQL:

/* create some random test data of sufficient size */
data have;
format date yymmddd10.;
length cf $6;
date = '01jan2010'd;
do until (date > '31dec2019'd);
  do i = 1 to 999;
    cf = 'A' !! put(i,z5.);
    amount = int(rand('uniform') * 10);
    output;
  end;
  date = intnx('month',date,1,'b');
end;
drop i;
run;

/* SQL method */
proc sql;
create table want1 as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;

/* data step method */
proc sort data=have;
by cf date;
run;

data want2;
set have;
by cf;
amount = dif(amount);
if not first.cf;
run;

proc sort data=want2;
by date cf;
run;

The log:

37         /* create some random test data of sufficient size */
38         data have;
39         format date yymmddd10.;
40         length cf $6;
41         date = '01jan2010'd;
42         do until (date > '31dec2019'd);
43           do i = 1 to 999;
44             cf = 'A' !! put(i,z5.);
45             amount = int(rand('uniform') * 10);
46             output;
47           end;
48           date = intnx('month',date,1,'b');
49         end;
50         drop i;
51         run;

NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.02 seconds


52         
53         /* SQL method */
54         proc sql;
55         create table want1 as
56         select a.date, a.cf, a.amount-b.amount as amount
57         from have a, have b
58         where a.cf=b.cf & intck('month',b.date,a.date)=1
59         order by date, cf;
NOTE: Table WORK.WANT1 created, with 118881 rows and 3 columns.

60         quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           11.56 seconds
      cpu time            3.88 seconds
      

61         
62         /* data step method */
63         proc sort data=have;
64         by cf date;
65         run;

NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.12 seconds
      cpu time            0.03 seconds
      

66         
67         data want2;
68         set have;
69         by cf;
70         amount = dif(amount);
71         if not first.cf;
72         run;

NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.02 seconds
      

73         
74         proc sort data=want2;
75         by date cf;
76         run;

NOTE: There were 118881 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.09 seconds
      cpu time            0.03 seconds

Note that while the combined 2 sorts and the data step consumed ~ 0.3 seconds, the SQL took close to 12(!) seconds, a multiple of 40!

Things like this gave me the reason to write down Maxim 10.

Also note that this dataset have is just 3MB(!!) in size; imagine running the SQL against a typical dataset that's in the range of several GBs, you would have a runtime counted in days or even weeks (because of the required sort and search operations, the time needed grows exponentially); the utility file built by the SQL is double the size of the input dataset, BTW.

 

 

PS thanks to @andreas_lds for the improved data step code.

 

Ksharp
Super User

Why do you have to choose SQL ? not suitable tool for this scenario.

 

data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18            A1                  1
31-Oct-18            A2                  2
31-Oct-18            A3                  3
30-Nov-18           A1                  5
30-Nov-18           A2                  5
30-Nov-18           A3                  5
31-Dec-18           A1                  10
31-Dec-18           A2                  20
31-Dec-18           A3                  30
;
run;
proc sql;
create table want as
 select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
 amount-calculated _amount as want
  from have as a;
quit;
Kurt_Bremser
Super User

@Ksharp wrote:
proc sql;
create table want as
 select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
 amount-calculated _amount as want
  from have as a;
quit;

This code, run against my "large" dataset from my example, took 12 minutes. The SAS SQL compiler seems to have a severe problem optimizing this query.

I had to change 'e' in the intnx to 'b' because of the way I set up the example data.

Ksharp
Super User

Totally agree. That is the reason why I prefer to data step .

 

Kurt,

If data organize well just like example.

 

want = dif3(amount);

is good enough.