Hi!
I have this dataset
DATE CF AMOUNT
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
and for each CF I would like to create a new table with differences in amounts between months i.e. the output
DATE CF AMOUNT
30-Nov-18 A1 4
30-Nov-18 A2 3
30-Nov-18 A3 2
31-Dec-18 A1 5
31-Dec-18 A2 15
31-Dec-18 A3 25
How can I do it with a proq sql?
Thanks!
@Kurt_Bremser : You don't need lag, but dif
data work.want;
set work.have;
by cf;
amount = dif(amount);
if not first.cf;
run;
@cmemtsa wrote:
Hi!
I have this dataset
DATE CF AMOUNT
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
and for each CF I would like to create a new table with differences in amounts between months i.e. the output
DATE CF AMOUNT
30-Nov-18 A1 4
30-Nov-18 A2 3
30-Nov-18 A3 2
31-Dec-18 A1 5
31-Dec-18 A2 15
31-Dec-18 A3 25
How can I do it with a proq sql?
Thanks!
Why proc sql?
Comparisons with lagged data are best done in a data step (Maxim 14):
data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
;
run;
proc sort data=have;
by cf date;
run;
data want;
set have;
by cf;
_amount = lag(amount);
if not first.cf
then do;
amount = amount - _amount;
output;
end;
drop _amount;
run;
proc sort data=want;
by date cf;
run;
proc print data=want noobs;
run;
Result:
date cf amount 2018-11-30 A1 4 2018-11-30 A2 3 2018-11-30 A3 2 2018-12-31 A1 5 2018-12-31 A2 15 2018-12-31 A3 25
@Kurt_Bremser : You don't need lag, but dif
data work.want;
set work.have;
by cf;
amount = dif(amount);
if not first.cf;
run;
While @FreelanceReinh shows nicely how to do it in SQL, this can also be used as a nice illustration why one does not do it with SQL:
/* create some random test data of sufficient size */
data have;
format date yymmddd10.;
length cf $6;
date = '01jan2010'd;
do until (date > '31dec2019'd);
do i = 1 to 999;
cf = 'A' !! put(i,z5.);
amount = int(rand('uniform') * 10);
output;
end;
date = intnx('month',date,1,'b');
end;
drop i;
run;
/* SQL method */
proc sql;
create table want1 as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;
/* data step method */
proc sort data=have;
by cf date;
run;
data want2;
set have;
by cf;
amount = dif(amount);
if not first.cf;
run;
proc sort data=want2;
by date cf;
run;
The log:
37 /* create some random test data of sufficient size */
38 data have;
39 format date yymmddd10.;
40 length cf $6;
41 date = '01jan2010'd;
42 do until (date > '31dec2019'd);
43 do i = 1 to 999;
44 cf = 'A' !! put(i,z5.);
45 amount = int(rand('uniform') * 10);
46 output;
47 end;
48 date = intnx('month',date,1,'b');
49 end;
50 drop i;
51 run;
NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.02 seconds
52
53 /* SQL method */
54 proc sql;
55 create table want1 as
56 select a.date, a.cf, a.amount-b.amount as amount
57 from have a, have b
58 where a.cf=b.cf & intck('month',b.date,a.date)=1
59 order by date, cf;
NOTE: Table WORK.WANT1 created, with 118881 rows and 3 columns.
60 quit;
NOTE: PROZEDUR SQL used (Total process time):
real time 11.56 seconds
cpu time 3.88 seconds
61
62 /* data step method */
63 proc sort data=have;
64 by cf date;
65 run;
NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE has 119880 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
real time 0.12 seconds
cpu time 0.03 seconds
66
67 data want2;
68 set have;
69 by cf;
70 amount = dif(amount);
71 if not first.cf;
72 run;
NOTE: There were 119880 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.02 seconds
73
74 proc sort data=want2;
75 by date cf;
76 run;
NOTE: There were 118881 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
real time 0.09 seconds
cpu time 0.03 seconds
Note that while the combined 2 sorts and the data step consumed ~ 0.3 seconds, the SQL took close to 12(!) seconds, a multiple of 40!
Things like this gave me the reason to write down Maxim 10.
Also note that this dataset have is just 3MB(!!) in size; imagine running the SQL against a typical dataset that's in the range of several GBs, you would have a runtime counted in days or even weeks (because of the required sort and search operations, the time needed grows exponentially); the utility file built by the SQL is double the size of the input dataset, BTW.
PS thanks to @andreas_lds for the improved data step code.
Why do you have to choose SQL ? not suitable tool for this scenario.
data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
;
run;
proc sql;
create table want as
select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
amount-calculated _amount as want
from have as a;
quit;
@Ksharp wrote:
proc sql; create table want as select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount, amount-calculated _amount as want from have as a; quit;
This code, run against my "large" dataset from my example, took 12 minutes. The SAS SQL compiler seems to have a severe problem optimizing this query.
I had to change 'e' in the intnx to 'b' because of the way I set up the example data.
Totally agree. That is the reason why I prefer to data step .
Kurt,
If data organize well just like example.
want = dif3(amount);
is good enough.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.