BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cmemtsa
Quartz | Level 8

Hi!

 

I have this dataset

 

DATE                    CF                   AMOUNT

31-Oct-18            A1                  1

31-Oct-18            A2                  2

31-Oct-18            A3                  3

30-Nov-18           A1                  5

30-Nov-18           A2                  5

30-Nov-18           A3                  5

31-Dec-18           A1                  10

31-Dec-18           A2                  20

31-Dec-18           A3                  30

 

and for each CF I would like to create a new table with differences in amounts between months i.e. the output

DATE                    CF                AMOUNT

30-Nov-18           A1                     4

30-Nov-18           A2                     3

30-Nov-18           A3                     2

31-Dec-18           A1                    5

31-Dec-18           A2                   15

31-Dec-18           A3                    25

 

How can I do it with a proq sql?

 

Thanks!

 

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

@Kurt_Bremser : You don't need lag, but dif

 

data work.want;
   set work.have;
   by cf;
   
   amount = dif(amount);
   
   if not first.cf;
run;

View solution in original post

8 REPLIES 8
andreas_lds
Jade | Level 19

@cmemtsa wrote:

Hi!

 

I have this dataset

 

DATE                    CF                   AMOUNT

31-Oct-18            A1                  1

31-Oct-18            A2                  2

31-Oct-18            A3                  3

30-Nov-18           A1                  5

30-Nov-18           A2                  5

30-Nov-18           A3                  5

31-Dec-18           A1                  10

31-Dec-18           A2                  20

31-Dec-18           A3                  30

 

and for each CF I would like to create a new table with differences in amounts between months i.e. the output

DATE                    CF                AMOUNT

30-Nov-18           A1                     4

30-Nov-18           A2                     3

30-Nov-18           A3                     2

31-Dec-18           A1                    5

31-Dec-18           A2                   15

31-Dec-18           A3                    25

 

How can I do it with a proq sql?

 

Thanks!

 


 

Why proc sql?

Kurt_Bremser
Super User

Comparisons with lagged data are best done in a data step (Maxim 14):

data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18            A1                  1
31-Oct-18            A2                  2
31-Oct-18            A3                  3
30-Nov-18           A1                  5
30-Nov-18           A2                  5
30-Nov-18           A3                  5
31-Dec-18           A1                  10
31-Dec-18           A2                  20
31-Dec-18           A3                  30
;
run;

proc sort data=have;
by cf date;
run;

data want;
set have;
by cf;
_amount = lag(amount);
if not first.cf
then do;
  amount = amount - _amount;
  output;
end;
drop _amount;
run;

proc sort data=want;
by date cf;
run;

proc print data=want noobs;
run;

Result:

      date    cf    amount

2018-11-30    A1       4  
2018-11-30    A2       3  
2018-11-30    A3       2  
2018-12-31    A1       5  
2018-12-31    A2      15  
2018-12-31    A3      25  

 

andreas_lds
Jade | Level 19

@Kurt_Bremser : You don't need lag, but dif

 

data work.want;
   set work.have;
   by cf;
   
   amount = dif(amount);
   
   if not first.cf;
run;
FreelanceReinh
Jade | Level 19

@cmemtsa wrote:

How can I do it with a proq sql?


Hi @cmemtsa,

 

You can do it by using a self join:

proc sql;
create table want as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;
Kurt_Bremser
Super User

While @FreelanceReinh shows nicely how to do it in SQL, this can also be used as a nice illustration why one does not do it with SQL:

/* create some random test data of sufficient size */
data have;
format date yymmddd10.;
length cf $6;
date = '01jan2010'd;
do until (date > '31dec2019'd);
  do i = 1 to 999;
    cf = 'A' !! put(i,z5.);
    amount = int(rand('uniform') * 10);
    output;
  end;
  date = intnx('month',date,1,'b');
end;
drop i;
run;

/* SQL method */
proc sql;
create table want1 as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;

/* data step method */
proc sort data=have;
by cf date;
run;

data want2;
set have;
by cf;
amount = dif(amount);
if not first.cf;
run;

proc sort data=want2;
by date cf;
run;

The log:

37         /* create some random test data of sufficient size */
38         data have;
39         format date yymmddd10.;
40         length cf $6;
41         date = '01jan2010'd;
42         do until (date > '31dec2019'd);
43           do i = 1 to 999;
44             cf = 'A' !! put(i,z5.);
45             amount = int(rand('uniform') * 10);
46             output;
47           end;
48           date = intnx('month',date,1,'b');
49         end;
50         drop i;
51         run;

NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.02 seconds


52         
53         /* SQL method */
54         proc sql;
55         create table want1 as
56         select a.date, a.cf, a.amount-b.amount as amount
57         from have a, have b
58         where a.cf=b.cf & intck('month',b.date,a.date)=1
59         order by date, cf;
NOTE: Table WORK.WANT1 created, with 118881 rows and 3 columns.

60         quit;
NOTE: PROZEDUR SQL used (Total process time):
      real time           11.56 seconds
      cpu time            3.88 seconds
      

61         
62         /* data step method */
63         proc sort data=have;
64         by cf date;
65         run;

NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.12 seconds
      cpu time            0.03 seconds
      

66         
67         data want2;
68         set have;
69         by cf;
70         amount = dif(amount);
71         if not first.cf;
72         run;

NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.02 seconds
      

73         
74         proc sort data=want2;
75         by date cf;
76         run;

NOTE: There were 118881 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.09 seconds
      cpu time            0.03 seconds

Note that while the combined 2 sorts and the data step consumed ~ 0.3 seconds, the SQL took close to 12(!) seconds, a multiple of 40!

Things like this gave me the reason to write down Maxim 10.

Also note that this dataset have is just 3MB(!!) in size; imagine running the SQL against a typical dataset that's in the range of several GBs, you would have a runtime counted in days or even weeks (because of the required sort and search operations, the time needed grows exponentially); the utility file built by the SQL is double the size of the input dataset, BTW.

 

 

PS thanks to @andreas_lds for the improved data step code.

 

Ksharp
Super User

Why do you have to choose SQL ? not suitable tool for this scenario.

 

data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18            A1                  1
31-Oct-18            A2                  2
31-Oct-18            A3                  3
30-Nov-18           A1                  5
30-Nov-18           A2                  5
30-Nov-18           A3                  5
31-Dec-18           A1                  10
31-Dec-18           A2                  20
31-Dec-18           A3                  30
;
run;
proc sql;
create table want as
 select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
 amount-calculated _amount as want
  from have as a;
quit;
Kurt_Bremser
Super User

@Ksharp wrote:
proc sql;
create table want as
 select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
 amount-calculated _amount as want
  from have as a;
quit;

This code, run against my "large" dataset from my example, took 12 minutes. The SAS SQL compiler seems to have a severe problem optimizing this query.

I had to change 'e' in the intnx to 'b' because of the way I set up the example data.

Ksharp
Super User

Totally agree. That is the reason why I prefer to data step .

 

Kurt,

If data organize well just like example.

 

want = dif3(amount);

is good enough. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 825 views
  • 7 likes
  • 5 in conversation