Hi everyone!
I have a dataset with a column 1 of stock return numbers, with each row representing a company. Now I'm supposed to calculate a new column 2, being the empirical cumulative distribution function (CDF) of column 1. I don't necessarily need the function itself, I just need to get the density number for each company. So basically, all the numbers in column 1 is supposed to follow this CDF, and each company has a different density number, based on its column 1 return.
I hope I've made myself clear...
I googled about it and found a "severity" procedure. Maybe my SAS system is too old but it says " Procedure SEVERITY not found." I wonder if there's a simple solution to my problem.
Many thanks!
This is what I did and finally worked:
data want;
set have;
retain running_total 0;
running_total+column1;
ecdf=running_total/overaltotal;
run;
---------------------------------------------------------------
Rick you are right. I did first sort the data by column1 and I had checked that there was no missing data. So what it is is:
proc sort data=have;
by column1;
run;
data want;
set have;
retain running_total 0;
running_total+column1;
ecdf=running_total/overaltotal;
run;
I think that means just sorting it and then dividing by the n for each order.
E.G.
Proc sort data=have; by column1; run;
data want;
set have;
ecdf=_n_/number_of_companies;
run;
If I'm way off post then please post some more details.
Proc Severity is part of ETS package and you may not have that licensed.
Hi Reeza,
Thanks for the reply. I think your way would imply that the returns follow a uniform distribution, hence the difference between different ecdf's equals 1, which isn't really the case for my column 1.
My column 1 looks something like this:
1 5940285.422
2 15182646.036
3 34539.400618
4 4184974.0126
5 3824416.2707
........
That is the definition of ECDF as far as I know it, and wikipedia:
http://en.wikipedia.org/wiki/Empirical_distribution_function
If you're looking for the distribution of the returns you can sum returns, still sort and then divide each return by the total return.
This seems to make a lot of sense. Thank you!
BTW, Reeza, the ecdf is supposed to range from 0 to 1. But with the method above, my biggest ecdf is smaller than 0.3, and the smallest almost zero. It seems to me it needs some kind of standardization. Any ideas?
I think I'll divide the ecdf with total return and then times my biggest return number, that way it's standardized.
I forgot about the cumulative part, you need to divide the running total by the total.
data want;
set have;
retain running_total;
running_total=running_total+column1;
ecdf=running_total/overaltotal;
run;
This is what I did and finally worked:
data want;
set have;
retain running_total 0;
running_total+column1;
ecdf=running_total/overaltotal;
run;
---------------------------------------------------------------
Rick you are right. I did first sort the data by column1 and I had checked that there was no missing data. So what it is is:
proc sort data=have;
by column1;
run;
data want;
set have;
retain running_total 0;
running_total+column1;
ecdf=running_total/overaltotal;
run;
Your program computes the cumulative proportion of total values. If that's what you want, then fine. But the cumulative proportion is not equivalent to the ECDF unless the original data are sorted and nonmissing. So make sure you remember to sort and remove missings.
PROC SEVERITY is in the SAS/ETS product, but you can use PROC UNIVARIATE to get this automatically.
proc univariate data=sashelp.class noprint;
var weight;
cdfplot weight;
ods output CDFPlot=ECDF;
run;
The data set ECDF contains two columns, ECDFX and ECDFY, that contain the empirical CDF.
By the way, the DATA step code works provided that all the data are nonmissing, but it should be adjusted to handle missing values.
Rick is this an 9.3 option? I'm trying it in 9.2 with a class variable but I can't seem to figure out what the table name is.
Output Added:
-------------
Name: UNIVAR12
Label: CDF Plot 1
Data Name: GRSEG
Path: Univariate.time_to_pay.CDFPlot.UNIVAR12
-------------
NOTE: PROCEDURE UNIVARIATE used (Total process time):
real time 2.70 seconds
cpu time 0.71 seconds
Output Added:
-------------
Name: UNIVAR13
Label: CDF Plot 1
Data Name: GRSEG
Path: Univariate.time_to_pay.CDFPlot.UNIVAR13
-------------
WARNING: Output 'CDFPlot' was not created. Make sure that the output
object name, label, or path is spelled correctly. Also, verify
that the appropriate procedure options are used to produce the
requested output object. For example, verify that the NOPRINT
option is not used.
It looks like you don't have ODS graphics turned on?
ODS graphics on;
Yup, that was it, thanks
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.