Hi Everyone,
I have a time series data and I would like to perform the following rank:
for each row, look back 199 rows (so total 200 rows).
Rank value within 200 rows (so the rank will be 1 to 200)
Identify the top 150,180, 190, 196 values.
I use the percentile concept but it I think I did something wrong.
For illustration purpose, I create a list of value of 1 to 200.
In the first code, when data is sorted in ascending order, the code identifies correctly these 150,180, 190, 196 values for the last row.
However, when I switch position for number 1 and 200, the code Incorrectly identify the percentile value (151 181 191 197) for the last row.
My point is that the change I make is just the order and therefore at the last row, the code still look at the full 200 values as before, and it should return the same percentile value.
Can anyone please help me to fix my code?
Thank you,
HHC
*NUMBERS ARE IN ORDER --> GOOD;
data have;
do value=1 to 200;
output;
end;
run;
data want; set have;
drop aa1-aa200;
array a{0:199} aa1-aa200;
retain aa:;
pct_low=pctl(75,of a{*});
pct_mid=pctl(90,of a{*});
pct_high=pctl(95,of a{*});
pct_top=pctl(98,of a{*});
a{mod(_n_,200)}=value;
run;
*SWICH POSITION OF NUMBER 1 and 200 --> NOT GOOD;
data have2; set have;
if _N_=200 then value=1;
if _N_=1 then value=200;run;
data want2; set have2;
drop aa1-aa200;
array a{0:199} aa1-aa200;
retain aa:;
pct_low=pctl(75,of a{*});
pct_mid=pctl(90,of a{*});
pct_high=pctl(95,of a{*});
pct_top=pctl(98,of a{*});
a{mod(_n_,200)}=value;
run;
If you only moved the assignment statement in the DATA WANT2 step but not the DATA WANT step, then you should not expect the same results. Because the DATA WANT2 step is evaluating 200 obs (200, then 2 through 199, then 1), while the DATA WANT step is still only evaluating 199 values (1 ascending to 199).
In short, data order IS NOT affecting the calculation of percentiles.
The problem is that you are calculating the percentiles PRIOR to assigning value to an element of array a. Therefore the last observation of the first instance is generating percentiles based on 199 values ranging from 1 to 199. It ignores the value=200.
But the second instance is generating percentiles based on 199 values of 200, followed by 2 through 199. It ignores the 1.
Why are you assigning values to the array elements prior to percentile calclulation?
So I move the array definition above the pct and the result are the same now.
However, the numbers are 150.5 180.5 190.5 and 196.5, still not the exact value yet of (151 181 191 197)
data want2; set have2;
*drop aa1-aa200;
array a{0:199} aa1-aa200;
a{mod(_n_,200)}=value;
retain aa:;
pct_low=pctl(75,of a{*});
pct_mid=pctl(90,of a{*});
pct_high=pctl(95,of a{*});
pct_top=pctl(98,of a{*});
run;
If you only moved the assignment statement in the DATA WANT2 step but not the DATA WANT step, then you should not expect the same results. Because the DATA WANT2 step is evaluating 200 obs (200, then 2 through 199, then 1), while the DATA WANT step is still only evaluating 199 values (1 ascending to 199).
In short, data order IS NOT affecting the calculation of percentiles.
This gives the same result for both tables:
data WANT;
set HAVE end=LASTOBS;
array A{200} _temporary_;
A[_N_]=VALUE;
if LASTOBS;
PCT_75=pctl(75,of A{*});
PCT_90=pctl(90,of A{*});
PCT_95=pctl(95,of A{*});
PCT_98=pctl(98,of A{*});
run;
It is also much more efficient.
Please do not use tabs, the code you pasted is not formatted properly.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.