I'm trying to do a secondary analysis on a very large dataset that has a lot of missing values. I'd like to see if there is any pattern within the missing values. I'd like to create a new dummy variable for missing so 0=missing and 1=not missing. I'm not sure how to go about this. I've created a smaller dataset that narrow down my predictor and outcome variables of interest. All of the variables are numeric/continuous and each variable has missing values. Any suggestions are greatly appreciated!
do you want as simple as this?
data have;
do i=1 to 100;
num=int(ranuni(76567)*100);
if mod(i,5)=0 then call missing(num);
output;
end;
run;
data want;
set have;
dummy_num=num ne .;
run;
Thank you so much! Can you just explain what this step does
num=int(ranuni(76567)*100);
if mod(i,5)=0 then call missing(num);
ignore the data have step. I wanted to create some sample data with random numbers as you didn't provide a sample
data want is what you need
For your interest,
Using ranuni function I created 100 random numbers between 0 and 100. For every 5th or in other words for every multiple of 5th value, i assigned them as missing. This is what data have does,
Data want is a simple boolean expression i.e if num is missing then true else false in the form of 1s and 0s
Hi @brookedanel,
If all you need are numbers and percentages of missing values for each variable or similar results for combinations of variables, you don't need dummy variables. Just use a format.
Example:
proc format;
value miss
._-.z = 'missing'
other = 'non-missing';
value $miss
' ' = 'missing'
other = 'non-missing';
run;
proc freq data=yourdata;
format _numeric_ miss. _character_ $miss.;
tables _all_ / missing; /* or replace _all_ with another variable list */
run;
Edit: The output (here for SASHELP.HEART) will look something like this:
The FREQ Procedure Cumulative Cumulative Status Frequency Percent Frequency Percent ---------------------------------------------------------------- non-missing 5209 100.00 5209 100.00 Cause of Death Cumulative Cumulative DeathCause Frequency Percent Frequency Percent ---------------------------------------------------------------- missing 3218 61.78 3218 61.78 non-missing 1991 38.22 5209 100.00 Age CHD Diagnosed Cumulative Cumulative AgeCHDdiag Frequency Percent Frequency Percent ---------------------------------------------------------------- missing 3760 72.18 3760 72.18 non-missing 1449 27.82 5209 100.00 ... portions of the output are not shown ...
Edit 2: Here is an example using a combination of variables:
proc freq data=sashelp.heart;
format _numeric_ miss. _character_ $miss.;
tables status*deathcause*agechddiag / missing list;
run;
Output:
Cumulative Cumulative Status DeathCause AgeCHDdiag Frequency Percent Frequency Percent ---------------------------------------------------------------------------------------------- non-missing missing missing 2663 51.12 2663 51.12 non-missing missing non-missing 555 10.65 3218 61.78 non-missing non-missing missing 1097 21.06 4315 82.84 non-missing non-missing non-missing 894 17.16 5209 100.00
Thank you so much @FreelanceReinh. I'm going to give this a shot. It's similar to the code I was using but I see now where I missed something. I appreciate the quick reply.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.