12-12-2016 04:32 PM
Hello SAS community!
I want to do a simple analysis of one dataset.
Let's say there are 2 variable A and B.
I want to know how many observations not missing variable A or B are in my dataset. Also, I want to know the number of observation missing value for A but not for B and vice versa.
What would be good a way to see this? Thank you very much
12-12-2016 04:49 PM
If the data is not really complicated or large I tend to start with something like:
Proc freq data=have;
tables a * b /list missing;
Note that the result could be sent to a data set using the OUT= option.
12-13-2016 05:05 PM
Thanks for the advice. However, the results were bit different from what I expected.
My dataset has more than 1 million observation and this program lists all list of observation under variable A and B.
What I want is something like this.
Count if A = missing
Count if B = missing
Count if A = missing & B =! missing
Count if B = missing & A =! missing
I would like to see one number for each of these 4 conditions something like 3,482 for the number of A having no observation
and 1,923 for observation missing A but having info about B.
I would appreciate if you can help me with this!
12-13-2016 05:22 PM
Did you try the example code.
Once you've assigned the formats and cross any variables you should get only Missing vs Not Missing.
Please post what you've tried.
12-13-2016 05:30 PM
Thanks Reeza for sharing the file!
I read the example 2, but I am not quite familiar with SAS, I was not sure how to apply this to my case.
To be bit more specific, two variables of interest are COMPANY_FKEY and BEST_EDGAR_TICKER. In the file you gave me,
the following example is given. The library I am using is cleaning.cleanrestat.
proc format; value $ missfmt ' '="Missing" other="Not Missing" ; value nmissfmt . ="Missing" other="Not Missing" ; run; proc freq data=home.ex2; tables _numeric_ _character_/missing; format _numeric_ nmissfmt. _character_ $missfmt.; run;
Once again, I am not quite familiar with SAS yet, I was trying to figure out which one is statement and which one is name of the variable.
If you can help me with this one, that will be very kind of you.
Another thing I noticed in the example is that the output only shows the number of observation missing and not missing only for each variable. I want to know the number of observation missing A but not B. It would be also nice if you can help me add that condition.
12-13-2016 07:55 PM
proc format; value $ missfmt ' '="Missing" other="Not Missing" ; value nmissfmt . ="Missing" other="Not Missing" ; run; proc freq data=home.ex2; tables VARIABLE1*VARIABLE2/missing; format _numeric_ nmissfmt. _character_ $missfmt.; run;
If you want the A*B cross tab, you can use an asterisk between the variables in the TABLE statement.
You can see the example above, where I assume VARIABLE1= <your variable name> and VARIABLE2 = <your variable name>
12-12-2016 05:09 PM
Assuming you create the formats along the lines mentioned in Reeza's reference, the complete program could be:
value $miss ' '='Missing' other='Present';
proc freq data=have;
tables a * b / list missing;
format a b $miss.;
12-13-2016 05:23 PM
Thanks for the response! I have a few additional questions.
The following is the result I got from running the program.
I ran the program on a sample dataset and I wasn't sure how to interpret this. It appears that there are 52 firm-year observations that does not have SIC_CODE_FKEY while they have COMPANY_FKEY. However, even when I changed the order of the variable, it did not show the other case which is missing COMPANY_FKEY and having SIC_CODE_FKEY.
Also, is it possible to know all 4 possible scenarios? 1) having both A & B 2) having only A 3) having only B 4) missing both A & B
and list the observation for each case?
Thank you for the help!
12-13-2016 06:31 PM
If it's only about two variables then some cross tabulation as already posted should do. If you want to list the missing/non missing combinations of n variables then you could also create a single new variable which contains a 0 for missing an a 1 for a value per source variable and then create a count of this single variable. See sample code below.
data have; set sashelp.cars (keep= Make Model EngineSize Cylinders); /* create some missing values */ if mod(ceil(ranuni(1)*20),5)=0 then call missing(Make); if mod(ceil(ranuni(1)*20),5)=0 then call missing(Model); if mod(ceil(ranuni(1)*20),5)=0 then call missing(EngineSize); if mod(ceil(ranuni(1)*20),5)=0 then call missing(Cylinders); run; data prep(keep=Miss0_NoMiss1); set have; length Miss0_NoMiss1 $ 4; Miss0_NoMiss1=cats(not missing(Make),not missing(Model),not missing(EngineSize),not missing(Cylinders)); run; proc freq data=prep; table Miss0_NoMiss1; run;