I'm trying to do a secondary analysis on a very large dataset that has a lot of missing values. I'd like to see if there is any pattern within the missing values. I'd like to create a new dummy variable for missing so 0=missing and 1=not missing. I'm not sure how to go about this. I've created a smaller dataset that narrow down my predictor and outcome variables of interest. All of the variables are numeric/continuous and each variable has missing values. Any suggestions are greatly appreciated!
do you want as simple as this?
data have;
do i=1 to 100;
num=int(ranuni(76567)*100);
if mod(i,5)=0 then call missing(num);
output;
end;
run;
data want;
set have;
dummy_num=num ne .;
run;
Thank you so much! Can you just explain what this step does
num=int(ranuni(76567)*100);
if mod(i,5)=0 then call missing(num);
ignore the data have step. I wanted to create some sample data with random numbers as you didn't provide a sample
data want is what you need
For your interest,
Using ranuni function I created 100 random numbers between 0 and 100. For every 5th or in other words for every multiple of 5th value, i assigned them as missing. This is what data have does,
Data want is a simple boolean expression i.e if num is missing then true else false in the form of 1s and 0s
Hi @brookedanel,
If all you need are numbers and percentages of missing values for each variable or similar results for combinations of variables, you don't need dummy variables. Just use a format.
Example:
proc format;
value miss
._-.z = 'missing'
other = 'non-missing';
value $miss
' ' = 'missing'
other = 'non-missing';
run;
proc freq data=yourdata;
format _numeric_ miss. _character_ $miss.;
tables _all_ / missing; /* or replace _all_ with another variable list */
run;
Edit: The output (here for SASHELP.HEART) will look something like this:
The FREQ Procedure
Cumulative Cumulative
Status Frequency Percent Frequency Percent
----------------------------------------------------------------
non-missing 5209 100.00 5209 100.00
Cause of Death
Cumulative Cumulative
DeathCause Frequency Percent Frequency Percent
----------------------------------------------------------------
missing 3218 61.78 3218 61.78
non-missing 1991 38.22 5209 100.00
Age CHD Diagnosed
Cumulative Cumulative
AgeCHDdiag Frequency Percent Frequency Percent
----------------------------------------------------------------
missing 3760 72.18 3760 72.18
non-missing 1449 27.82 5209 100.00
... portions of the output are not shown ...
Edit 2: Here is an example using a combination of variables:
proc freq data=sashelp.heart;
format _numeric_ miss. _character_ $miss.;
tables status*deathcause*agechddiag / missing list;
run;
Output:
Cumulative Cumulative Status DeathCause AgeCHDdiag Frequency Percent Frequency Percent ---------------------------------------------------------------------------------------------- non-missing missing missing 2663 51.12 2663 51.12 non-missing missing non-missing 555 10.65 3218 61.78 non-missing non-missing missing 1097 21.06 4315 82.84 non-missing non-missing non-missing 894 17.16 5209 100.00
Thank you so much @FreelanceReinh. I'm going to give this a shot. It's similar to the code I was using but I see now where I missed something. I appreciate the quick reply.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.