- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have a few questions on how to calculate a mean/quartile:
1. I have a data set that has two variables that I want to calculate the mean and other quartiles for. The below statement returns what appears to be the means and the first and third quartiles. From a code perspective, is this correct?
proc means data=s10sorted median q1 q3 stddev;
var dur_filing_issuejoin dur_issuejoin_dispo;
run;
2. In the columns for the above variables, I have many null values that I want to factor into the median/quart. In other words, I do not want to calculate a mean or quartile if 50% (or 75% for 3rd quartile) of the observations have not occurred yet. I am assuming the above procedure just calculates the mean/quart for the numbers that are there, so it doesn't care if the appropriate number of obs have occurred?
3. I need to export this to a spreadsheet and am unfamiliar with how to do this. I know proc print prints out to screen. Is there another way to generate a .csv, .txt, or .xls?
Paul
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Editor's Note: Thanks to @Reeza for providing the answer that addressed the original question and for pointing to an alternative using PROC LIFETEST. Most Base procedures, including PROC MEANS and PROC RANK, will exclude missing values before calculating statistics. Changing missing values to 0 in advance would not help in this case. If you don't want a statistic to be calculated if there are not enough non-missing values, you would need to check in advance for the number of non-missing values and then use macro logic to conditionally execute the code based on that number.
1. This is correct, but why would you calculate stddev but quartiles? Stdev is usually used with the mean?
2. Does this mean you want to exclude your null values? If they're coded as missing, SAS will exclude them, if you're looking at doing something else, like censoring you might want to consider proc lifetest instead but it depends on your data and your question.
3.
ods tagsets.excelxp file='Temp.xml';
proc means data=s10sorted median q1 q3 stddev;
var dur_filing_issuejoin dur_issuejoin_dispo;
run;
ods tagsets.excelxp close;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Editor's Note: Thanks to @Reeza for providing the answer that addressed the original question and for pointing to an alternative using PROC LIFETEST. Most Base procedures, including PROC MEANS and PROC RANK, will exclude missing values before calculating statistics. Changing missing values to 0 in advance would not help in this case. If you don't want a statistic to be calculated if there are not enough non-missing values, you would need to check in advance for the number of non-missing values and then use macro logic to conditionally execute the code based on that number.
1. This is correct, but why would you calculate stddev but quartiles? Stdev is usually used with the mean?
2. Does this mean you want to exclude your null values? If they're coded as missing, SAS will exclude them, if you're looking at doing something else, like censoring you might want to consider proc lifetest instead but it depends on your data and your question.
3.
ods tagsets.excelxp file='Temp.xml';
proc means data=s10sorted median q1 q3 stddev;
var dur_filing_issuejoin dur_issuejoin_dispo;
run;
ods tagsets.excelxp close;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Reeza.
1. I was experimenting with proc means, so stdev was just part of that.
2. No, I want the null values to be factored in. What I want to do is to be sure the median and quartile calculations are done when 50% or 75% of the observations have actually occurred. I don't want to calculate the either based on SAS just calculating it for the values that are present.
So if I have 10 potential observations and only 4 of them have occurred as of a certain date, I should not be able to calculate a median for that variable. The way proc means seems to work is that it will calculate a median for the 4 present values and exclude the 6 that are missing.
Now if there were 5 observations (out of 10), I should be able to calculate a median because half of the observations have occurred and it doesn't matter what the remaining values are, they are already past the median point.
So I am wondering if this is something that must be coded or whether there is a configuration or another procedure that accomplishes this?
3. Thank you. I will try this.
Paul
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think you need to look into survival analysis and the Kaplan Meier Curve and see if its what you're looking for.
The related SAS procedure is proc lifetest, if you provide a bit about the subject matter I can help direct you to a relevant resource.
There are slight difference in terms when looking at churn models for sales, survival of patients and failure of products in a manufacturing industry.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, can you please point me to the lifetest resource?
Also, on the function to export results of this into a file, the below code is part of a much larger set of code. Is there parameter that just points to a specific PROC or DATA section of code and not the whole thing?
Paul
ods tagsets.excelxp file='Temp.xml';
proc means data=s10sorted median q1 q3 stddev;
var dur_filing_issuejoin dur_issuejoin_dispo;
run;
ods tagsets.excelxp close;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Survival/Failure Time Analysis
Introduction to Survival Analysis with SAS Seminar
You can create an output data set from proc means and then export that out instead using proc means.
Here's a paper on that. If you have 9.3 you may want to look into the stackODS option as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Or Maybe you can check PROC RANKS , which might be your demand .
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PROC RANK with the percent option, and then use if logic to identify the quartiles in a data step.
Ex.
Proc rank data=data1 out=data2 percent;
var quartile-variable;
ranks percentile;
run;
data data3;
set data2;
if percentile < .25 then quartile=1;
else if percentile < .5 then quartile = 2;
else if percentile < .75 then quartile = 3;
else quartile=4;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Proc rank will not necessarily work because the percent option only calculates percents for non-missing values: "Percent: divides each rank by the number of observations that have nonmissing values of the variable and multiplies the result by 100 to get a percentage...."
While this makes sense, I need to factor in missing values in determining a median.
Paul
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ok, I see what you're saying. Does it matter where those missing ones are ranked? Could you set all missing values to the minimum value so that they are all put into the lowest quartile? Then at the 25%, 50%, etc. you will be at a point where you have that same % of the observations... just skewed up above their actual relative position when only compared to other non-missing.