Hi,
I have a dataset on which I want to calculate the recursive empirical CDF in a following way:
- first 4 periods are treated as non-recursive - I calculate "classic" CDF based on the the four observations
- for the period 5 I start recursive calculations that take into account all the data available up to each period, but without changing previously calculated values.
Below is the example showing the idea. How to automate this? In real life I have a few thousand periods. Should I put it into the macro that will loop through observations, every time taking one more? Or there is a more efficient way?
data have;
input period x;
datalines;
1 10
2 5
3 2
4 8
5 1
6 3
7 11
8 14
9 20
10 5
;
run;
/* Empirical CDF based on the first 4 observations (i.e. the pre-recursive period)*/
proc sort data=have (where=(period<5)) out=pre_rec;
by x;
run;
proc sql noprint;
select count(*) into :N
from pre_rec;
quit;
data pre_rec;
set pre_rec;
cdf_t=_n_/&N;
drop x;
run;
/* Calculations for the 5th period:
- include all the data available up to period 5
- calculate CDF but only keep the result for period 5 (as CDFs for periods 1-4 have already been calculated, I don't want to change them*/
data period_5;
set have;
where period<6;
run;
proc sort data=period_5;
by x;
run;
proc sql noprint;
select count(*) into :N
from period_5;
quit;
data period_5;
set period_5;
where period=5; /* keep only the CDF for period 5*/
cdf_t=_n_/&N;
drop x;
run;
/* Calculations for the 6th period:
- include all the data available up to period 6
- calculate CDF but only keep the result for period 6 (as CDFs for periods 1-5 have already been calculated, I don't want to change them*/
data period_6;
set have;
where period<7;
run;
proc sort data=period_6;
by x;
run;
proc sql noprint;
select count(*) into :N
from period_6;
quit;
data period_6;
set period_6;
where period=6; /* keep only the CDF for period 6 - for the previous periods I keep the CDFs calculated before*/
cdf_t=_n_/&N;
drop x;
run;
/* Etc for other periods*/
/* Final dataset:
- First 4 values calculated on the pre-recursive period
- 5th value calculated on the basis of first 5 observations
- 6th value calculated on the basis of first 6 observations
- ...
*/
data want;
set pre_rec period_5 period_6;
run;
First, I don't think "recursive" is the right word here. Nevertheless, this could be turned into a macro like this:
%macro dothis(startperiod=,endperiod=);
%do i=&startperiod %to &endperiod;
data period_&i;
set _temp_(where=(period<=&i));
run;
data period_&i;
set period_&i;
cdf_t=_n_/&i;
n=_n_;
%if &i^=&startperiod %then where period=&i%str(;);
drop x;
run;
proc append base=all_data1 new=period_&i;
run;
%end;
%mend;
proc sort data=have out=_temp_;
by x;
run;
%dothis(startperiod=4,endperiod=10)
First, I don't think "recursive" is the right word here. Nevertheless, this could be turned into a macro like this:
%macro dothis(startperiod=,endperiod=);
%do i=&startperiod %to &endperiod;
data period_&i;
set _temp_(where=(period<=&i));
run;
data period_&i;
set period_&i;
cdf_t=_n_/&i;
n=_n_;
%if &i^=&startperiod %then where period=&i%str(;);
drop x;
run;
proc append base=all_data1 new=period_&i;
run;
%end;
%mend;
proc sort data=have out=_temp_;
by x;
run;
%dothis(startperiod=4,endperiod=10)
Moving or rolling CDF is a better term.
@PaigeMiller wrote:
First, I don't think "recursive" is the right word here. Nevertheless, this could be turned into a macro like this:
%macro dothis(startperiod=,endperiod=); %do i=&startperiod %to &endperiod; data period_&i; set _temp_(where=(period<=&i)); run; data period_&i; set period_&i; cdf_t=_n_/&i; n=_n_; %if &i^=&startperiod %then where period=&i%str(;); drop x; run; proc append base=all_data1 new=period_&i; run; %end; %mend; proc sort data=have out=_temp_; by x; run; %dothis(startperiod=4,endperiod=10)
Many thanks!
You are using a WHERE statement to filter observations in the data step calculating cdf_t. While WHERE will filter the observations, it will NOT increment _N_. That's because the WHERE filter always outsources its work to the data engine (e..g sql, sas, excelms, oracle, etc.), and the data step never sees non-qualifying records (and therefore never iterates, meaning no increment of _N_).
So when you calculate CDF_T=_n_/&N
you are really generating CDF_T = 1/&N
If you want to allow _N_ to increment, you have to use the subsetting IF statement instead of the WHERE statement.
Do you mean this part of my code?
data period_5;
set period_5;
where period=5; /* keep only the CDF for period 5*/
cdf_t=_n_/&N;
drop x;
run;
Or the code from @PaigeMiller ?
Here is another solution, that does not use a macro.
I think there is an error in your program. This code
data period_5;
set period_5;
where period=5; /* keep only the CDF for period 5*/
cdf_t=_n_/&N;
drop x;
run;
will always set CDF_T=1/&N, in the data shown that would be 1/5=0.2.
Why? Because when you use a WHERE clause, only that observation is read, and that means that _N_ wil always be 1.
I I understand your thinking correctly, what you want is what you get if you replace that WHERE clause with an IF statement:
data period_5;
set period_5;
if period=5; /* keep only the CDF for period 5*/
cdf_t=_n_/&N;
drop x;
run;
In the actual example. this will give the same result, but that is only because the X value for PERIOD=5 (1) happens to be the lowest so far. If. for instance, the X value for PERIOD=5 had been 4, you would want a value of 0.4, not 0.2 (as 4 is now the second-smallest value in the PERIOD_5 table. Correct?
What you probably want is the reverse (meaning smallest first) rank of the X value divided by the number of observations having PERIOD<=[the current period].
So what you need is a table containing
in order to calculate the number you want. The first two are easy to calculate, the hard stuff is finding an efficient way to do the third calculation, which is the recursive part.
First, prepare the data:
%let last_classic=4; /* actual value of the last period you want done using the classical way */
proc sort data=have;
by period;
run;
data temp1;
set have;
n_period=_N_;
if period=&last_classic then
call symputx('n_last_classic',_N_);
run;
proc sort data=temp1;
by x;
run;
Then do the recursive part by using the OPEN function with a dynamic WHERE clause:
data want;
set temp1;
if n_period<=&n_last_classic then
cdf_t=N_period/&n_last_classic;
else do;
dsid=open(cats('temp1(where=(n_period<=',n_period,'))'),'I');
vnum_n_period=varnum(dsid,'n_period');
do cdf_t=1 by 1 until(n_period=getvarn(dsid,vnum_n_period));
rc=fetch(dsid);
end;
cdf_t=cdf_t/n_period;
rc=close(dsid);
end;
keep period cdf_t x;
run;
Finally, sort:
proc sort data=want;
by period;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.