## Stats question related to SAS

Super Contributor
Posts: 418

# Stats question related to SAS

Hello everyone. I have been tasked with finding correlations between missing data in variable A comapred with the observations (groups) in variable B. It is believed that different groups in variable B (say state) have different influences on variable A being non reported (say Debt to income).

I was thinking there are two ways of doing this (both could be wrong, based upon my 1 semester of intro to statistics).

1st, I could create a frequency chart of states across the entire dataset, and then do this again after including only the missing variables from Variable A. THen I could look for significant differences  (perhaps I would compute the 5% and 95% conf interval of each group from the entire bucket, and see if the frequency distribution of missing is within that band?). My first question is, is there a procedusre in sas that would give the lower 5% and upper 95% confidence interval of an observation % by banding group? something like the following...

State         % of data              Lower 5%                 Upper 95%

CA                11%                      10.54%                    11.46%

CO               8%                         6.58%                        9.42%

...

etc.

If this is not a logical or correct way of doing this analysis, please let me know and move onto my next step (this is the one i'm more inclined to use).

The other idea is to convert variable A into a 0 or 1 (1 for missing, 0 for not missing) and then do a logistic regression on A from variable B. Then any groups that had a significant correlation would be the groups that affect the missing data.

My question is, doesn't logistic regression have to assume that the depenedent and independent variables  are linearly correlated? In my example I have an outcome % and a categorical variable, so how would that even make sense?

Also, I know there is an option that needs to be specified for logistic regression to get the coefficients in the correct order, plus allow it to know your independent variable is categorical. Could anyone post some code of a logistic regression they have done similar to this, or a link to a good paper on it?

Does either of these methods seem appropriate (do they even do what i'm trying to do?), and if so can anyone point out some of the assumptions I would need to test to use either method?

Thanks so much for your time, if you need further information from me please let me know!
Brandon

P.S. I know this is about 90% stats and 10% sas problem, but this forum has always been extremely helpful so I figured I could ask here and maybe still get help for myself and others.

Super Contributor
Posts: 418

## Re: Stats question related to SAS

On a more sas related portion, how do I get an output dataset that contains the P values and the associated estimated Odds ratios from a sas logistic regression?

I want a dataset or some kind of output that allows me to distinguish if a logistic regression analysis was successful (aka Hosmer and Lemeshow goodness of fit > alpha). Then I'd also like to know which one of the groups within the dataset had a distinct response that is statistically relavent.

For example, attached is an output at the state level. I'd like to pick up the states who's PR > ChiSQ is < .05. (final column). I am not sure if this can be output from proc logistic.

Thanks for the help!

Super User
Posts: 23,683

## Re: Stats question related to SAS

That's a lot of questions in one!

First off, I think one of your question is trying to identify if a variable is missing at random or missing systematically? For example, are people with higher income less likely to report or are minorities less likely to report? Is this correct?

Your second question, for proc logistic I'd post as a new question

Super Contributor
Posts: 418

## Re: Stats question related to SAS

Haha yes it is Reeza.

And you are correct. I'm trying to figure out if a missing variable is missing at random across data or systematically related to other variables within the dataset. You hit the nail on the head.

Super Contributor
Posts: 418

## Re: Stats question related to SAS

DO either of my options above make sense for what I am trying to do given your understanding Reeza?

Thanks!
Brandon

Super User
Posts: 23,683

## Re: Stats question related to SAS

Brandon,

I'd go with the logistic regression, because there could be more than one factor that affects missing. For example, males with higher incomes are more likely to report while females with higher incomes are less likely.  Extremely wealthy are way less likely to report regardless of age.  Some of this though goes back to knowing your data and understanding how it was collected rather than statistical analysis.

I'd highly suggest googling the topic with search terms "MISSING AT RANDOM" and specifically checking the Statistics Canada/Census Bureau or whatever Stat Agency is in your country for what they say.

Canada has had significant issues with this recently as our census became "voluntary" rather than mandatory.

Discussion stats
• 5 replies
• 353 views
• 0 likes
• 2 in conversation