BookmarkSubscribeRSS Feed
popopo17
Calcite | Level 5

Hi, everybody. I had a question with regard to caculating a p-value for two differnt samples. 

So, I have two categories based on mortality (if mortality rate greater than 30%, coded as 1, otherwise coded as 2).

And based on this category, I have a sample size of 15 (N1=15) for Category 1, and a sample size of 18 (N2=18) for Category 2.

For each observations or counties, I have a number of people who are poor (based on a certain income level) and the total number of population within that county. The number of people who are poor are used as the numerator and the total number of population are used as the denominator. 

I also can calculate the overall mean or average percentage of the poor based on the Category (1 or 2) I have described above. 

My question is:

  How can I write my SAS code to compare the overall mean or average of Category 1 vs. Category 2?? And is there a way to calculate the Variance or Standard Deviation (SD) for each Category? I would greatly appreciate your help!  

 

 

7 REPLIES 7
PaigeMiller
Diamond | Level 26

@popopo17 wrote:

 

My question is:

  How can I write my SAS code to compare the overall mean or average of Category 1 vs. Category 2?? And is there a way to calculate the Variance or Standard Deviation (SD) for each Category? I would greatly appreciate your help!   

 


All of the data you mentioned is categories ... so there's no such thing as a mean or average of category 1 vs category 2.

 

Explain in a lot more detail. Show us (a portion of) your data. State the exact hypothesis you wish to test, or the exact statistic you wish to compute.

--
Paige Miller
popopo17
Calcite | Level 5

I want to know if the average Poor population percetage in Category 1 is different compared to the one in Category 2. I have attached the file in an excel format. Thank you. 

PaigeMiller
Diamond | Level 26

Most people will not download Excel files from here (or other Microsoft Office documents), it is a security risk.

 

If you want to compare percents in two different groups, then do not call them "averages". They are percents.

 

Adding to what @Reeza said, he said consult a statistician ... and I am a statistician so here is my comment: your original problem statement says

For each observations or counties, I have a number of people who are poor (based on a certain income level) and the total number of population within that county. The number of people who are poor are used as the numerator and the total number of population are used as the denominator.

 

There is no statistical test here. This is not a statistical question. If you have the information for two complete populations (which you just said you did), then there is no such thing as a statistical test here. I see you originally called these "samples", but they are not samples, they are the entire population. The only time you perform statistical tests is when the data is a sample of an entire population, not the entire population itself.

 

--
Paige Miller
popopo17
Calcite | Level 5

1) I am referring to the average of the percents within each category. So that is why I said that in the first place. 

 

2) So, what if the numbers were samples? Would I still not be able to compare the average of the percents for Category 1 vs. Category 2??

PaigeMiller
Diamond | Level 26

You have stated that the numbers are population number and so they are not samples.

 

There is no inferential statistical test to compare populations. This falls outside the realm of inferential statistics. A t-test is inferential statistics. This is a basic, fundamental first principle of statistics.

--
Paige Miller
Reeza
Super User

@PaigeMiller wrote:

Most people will not download Excel files from here (or other Microsoft Office documents), it is a security risk.

 

If you want to compare percents in two different groups, then do not call them "averages". They are percents.

 

Adding to what @Reeza said, he said consult a statistician ... and I am a statistician so here is my comment: your original problem statement says

For each observations or counties, I have a number of people who are poor (based on a certain income level) and the total number of population within that county. The number of people who are poor are used as the numerator and the total number of population are used as the denominator.

 

There is no statistical test here. This is not a statistical question. If you have the information for two complete populations (which you just said you did), then there is no such thing as a statistical test here. I see you originally called these "samples", but they are not samples, they are the entire population. The only time you perform statistical tests is when the data is a sample of an entire population, not the entire population itself.

 


I'm going to disagree with @PaigeMiller, though he is the statistician so you should probably listen to him. 

 

I think you have a more complicated problem and you need to explain it more thoroughly. Although the data is representative of the populations comparing the different populations is a statistical test. Your 'measurements' are not exact is the issue, you're likely having an estimate of the incidence of poverty. Here's what I'd do to start, for each county find your percentage estimate of poor with a confidence interval. Then plot the estimates on a graph with the intervals and look at that. Then you can also look at the combined data but then you lose the impact of different communities. I would suggest reviewing Public Health analysis, because this type of analysis is common there - comparing rates among different populations.  There's a reason they use age/sex standardized rates before making comparisons for example. 

 

Here's where I would start, but I don't think this is what I would consider exploratory analysis and would not publish this. 

 

*create data from sample;

data have;
    input Poor Total_Pop Category;
    county=_n_;
    cards;
97799 338969 0 
68 235 1 
1755 5615 1 
16040 44880 0 
2495 8224 1 
1959 6056 0 
67553 258206 0 
2713 5852 0 
8082 39015 1 
111970 275455 1 
2518 7642 1 
10689 25812 0 
17410 50523 1 
1580 3833 1 
95898 253482 0 
15157 41597 1 
5480 13076 0 
1500 4800 1 
841262 2329364 0 
14201 42446 1
;
run;

*Expand the data to allow the use of PROC FREQ,
there are possible other ways but I couldn't figure it out
and this is just an example;

data expanded;
    set have;

    do i=1 to total_pop;

        if i<=poor then
            poor_category=0;
        else
            poor_category=1;
        output;
    end;
run;

*Calculate rates for each county with Confidence Limits;

proc freq data=expanded;
    by county;
    table poor_category/ out=summary1 binomial (ac wilson exact) alpha=0.05;
    ods output binomialCLS=estimates;
run;

*Limit summary to Wilson CLs;

data estimates2;
    set estimates;
    where type='Wilson';
run;

*Merge back in county information and rename;

data estimates3;
    merge estimates2 have (keep=county category);
    by county;
    rename category=mortality;
run;

*Sort data for display;

proc sort data=estimates3;
    by mortality;
run;

*Format for mortality;

proc format ;
    value mortality_fmt 0='Low Mortality' 1='High Mortality';
run;

*High low type graph for simplicity;

proc sgplot data=estimates3;
    highlow x=county high=upperCL low=LowerCL / close=proportion group=mortality 
        grouporder=data highlabel=county;
    xaxis display=none;
    format mortality mortality_fmt.;
    label mortality='Mortality';
run;

*Summary accross all counties;

proc freq data=expanded;
    table category*poor_category / binomial riskdiff oddsratio;
run;
Reeza
Super User

I don't think this is a t-test. A t-test would simplify the data too much and it's beyond that. 

Essentially you have a form of repeated measures for each category, but I also don't know how to deal with that, so all I'm going to say is you may want to consult a statistician here or make sure you're doing the correct test in some manner.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 645 views
  • 2 likes
  • 3 in conversation