Hi!
I have this dataset
DATE CF AMOUNT
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
and for each CF I would like to create a new table with differences in amounts between months i.e. the output
DATE CF AMOUNT
30-Nov-18 A1 4
30-Nov-18 A2 3
30-Nov-18 A3 2
31-Dec-18 A1 5
31-Dec-18 A2 15
31-Dec-18 A3 25
How can I do it with a proq sql?
Thanks!
@Kurt_Bremser : You don't need lag, but dif
data work.want;
set work.have;
by cf;
amount = dif(amount);
if not first.cf;
run;
@cmemtsa wrote:
Hi!
I have this dataset
DATE CF AMOUNT
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
and for each CF I would like to create a new table with differences in amounts between months i.e. the output
DATE CF AMOUNT
30-Nov-18 A1 4
30-Nov-18 A2 3
30-Nov-18 A3 2
31-Dec-18 A1 5
31-Dec-18 A2 15
31-Dec-18 A3 25
How can I do it with a proq sql?
Thanks!
Why proc sql?
Comparisons with lagged data are best done in a data step (Maxim 14):
data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
;
run;
proc sort data=have;
by cf date;
run;
data want;
set have;
by cf;
_amount = lag(amount);
if not first.cf
then do;
amount = amount - _amount;
output;
end;
drop _amount;
run;
proc sort data=want;
by date cf;
run;
proc print data=want noobs;
run;
Result:
date cf amount 2018-11-30 A1 4 2018-11-30 A2 3 2018-11-30 A3 2 2018-12-31 A1 5 2018-12-31 A2 15 2018-12-31 A3 25
@Kurt_Bremser : You don't need lag, but dif
data work.want;
set work.have;
by cf;
amount = dif(amount);
if not first.cf;
run;
While @FreelanceReinh shows nicely how to do it in SQL, this can also be used as a nice illustration why one does not do it with SQL:
/* create some random test data of sufficient size */
data have;
format date yymmddd10.;
length cf $6;
date = '01jan2010'd;
do until (date > '31dec2019'd);
do i = 1 to 999;
cf = 'A' !! put(i,z5.);
amount = int(rand('uniform') * 10);
output;
end;
date = intnx('month',date,1,'b');
end;
drop i;
run;
/* SQL method */
proc sql;
create table want1 as
select a.date, a.cf, a.amount-b.amount as amount
from have a, have b
where a.cf=b.cf & intck('month',b.date,a.date)=1
order by date, cf;
quit;
/* data step method */
proc sort data=have;
by cf date;
run;
data want2;
set have;
by cf;
amount = dif(amount);
if not first.cf;
run;
proc sort data=want2;
by date cf;
run;
The log:
37 /* create some random test data of sufficient size */ 38 data have; 39 format date yymmddd10.; 40 length cf $6; 41 date = '01jan2010'd; 42 do until (date > '31dec2019'd); 43 do i = 1 to 999; 44 cf = 'A' !! put(i,z5.); 45 amount = int(rand('uniform') * 10); 46 output; 47 end; 48 date = intnx('month',date,1,'b'); 49 end; 50 drop i; 51 run; NOTE: The data set WORK.HAVE has 119880 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.05 seconds cpu time 0.02 seconds 52 53 /* SQL method */ 54 proc sql; 55 create table want1 as 56 select a.date, a.cf, a.amount-b.amount as amount 57 from have a, have b 58 where a.cf=b.cf & intck('month',b.date,a.date)=1 59 order by date, cf; NOTE: Table WORK.WANT1 created, with 118881 rows and 3 columns. 60 quit; NOTE: PROZEDUR SQL used (Total process time): real time 11.56 seconds cpu time 3.88 seconds 61 62 /* data step method */ 63 proc sort data=have; 64 by cf date; 65 run; NOTE: There were 119880 observations read from the data set WORK.HAVE. NOTE: The data set WORK.HAVE has 119880 observations and 3 variables. NOTE: PROZEDUR SORT used (Total process time): real time 0.12 seconds cpu time 0.03 seconds 66 67 data want2; 68 set have; 69 by cf; 70 amount = dif(amount); 71 if not first.cf; 72 run; NOTE: There were 119880 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.07 seconds cpu time 0.02 seconds 73 74 proc sort data=want2; 75 by date cf; 76 run; NOTE: There were 118881 observations read from the data set WORK.WANT2. NOTE: The data set WORK.WANT2 has 118881 observations and 3 variables. NOTE: PROZEDUR SORT used (Total process time): real time 0.09 seconds cpu time 0.03 seconds
Note that while the combined 2 sorts and the data step consumed ~ 0.3 seconds, the SQL took close to 12(!) seconds, a multiple of 40!
Things like this gave me the reason to write down Maxim 10.
Also note that this dataset have is just 3MB(!!) in size; imagine running the SQL against a typical dataset that's in the range of several GBs, you would have a runtime counted in days or even weeks (because of the required sort and search operations, the time needed grows exponentially); the utility file built by the SQL is double the size of the input dataset, BTW.
PS thanks to @andreas_lds for the improved data step code.
Why do you have to choose SQL ? not suitable tool for this scenario.
data have;
input date :date9. cf $ amount;
format date yymmddd10.;
cards;
31-Oct-18 A1 1
31-Oct-18 A2 2
31-Oct-18 A3 3
30-Nov-18 A1 5
30-Nov-18 A2 5
30-Nov-18 A3 5
31-Dec-18 A1 10
31-Dec-18 A2 20
31-Dec-18 A3 30
;
run;
proc sql;
create table want as
select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount,
amount-calculated _amount as want
from have as a;
quit;
@Ksharp wrote:
proc sql; create table want as select *,(select amount from have where date=intnx('month',a.date,-1,'e') and cf=a.cf) as _amount, amount-calculated _amount as want from have as a; quit;
This code, run against my "large" dataset from my example, took 12 minutes. The SAS SQL compiler seems to have a severe problem optimizing this query.
I had to change 'e' in the intnx to 'b' because of the way I set up the example data.
Totally agree. That is the reason why I prefer to data step .
Kurt,
If data organize well just like example.
want = dif3(amount);
is good enough.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.