Suppose I have the following data:
data temp;
infile DATALINES dsd missover;
input ID Miles Windows Occupants;
CARDS;
01, 100, 2, 5
02, 200, 4, 5
03, 300, 2, .
04, ., 6, 2
05, 500, ., 8
06, 600, ., .
07, 700, 4, 5
08, 800, 4, .
02, 4, 1.7, 3
02, 5, 5.1, 4
;
run;
Is there a way for me to check whether any set of variables, excluding another set of variables is missing, and then set that value =0?
For instance, I want to check if Windows and Occupants are missing, but not Miles. I know how to do this individually, but I have around 50 or so variables I want to check, and 5 I want to exclude, so it would be somewhat tedious to write it all out.
My output would be:
data temp;
infile DATALINES dsd missover;
input ID Miles Windows Occupants;
CARDS;
01, 100, 2, 5
02, 200, 4, 5
03, 300, 2, 0
04, ., 6, 2
05, 500, 0, 8
06, 600, 0, 0
07, 700, 4, 5
08, 800, 4, 0
02, 4, 1.7, 3
02, 5, 5.1, 4
;
run;
You can do this by adding a statement in the do loop:
proc sql noprint;
select name
into:mylist separated by " "
from dictionary.columns
where libname="WORK" and memname="TEMP" and name not in ("ID" "Miles"); /* -> select your columns*/
quit;
data want;
set temp;
array _miss (*) &mylist;
do i=1 to dim(_miss);
if _miss(i) = . then _miss(i) = 0;
_miss(i) = _miss(i) / /*your_variable*/;
end;
drop i;
run;
In fact, the array is just a way to repeat the same manipulation on a series of columns.
In practice, the column can be identify by its position in the array:
e.g. _miss(1) corresponds to the first column referenced in the array, so Windows
e.g. _miss(2) corresponds to the second column referenced in the array, so Occupants
...
So instead of writing
If Windows = . then Windows =0; Windows = Windows / x; If Occupants = . then Occupants =0; Occupant = Occupants / x;
You can write:
If _miss(1) = . then _miss(1) =0; _miss(1) = _miss(1) / x; If _miss(2) = . then _miss(2) =0; _miss(2) = _miss(2) / x;
... and what is better -> loop through the position 1, 2, ...
Best,
You can use an array to achieve this:
data want;
set temp;
array _miss (*) Windows Occupants /* NB: reference the variables that are concerned or use _numeric_ if you want to refer to all numeric variables*/;
do i=1 to dim(_miss);
if _miss(i) = . then _miss(i) = 0;
end;
drop i;
run;
Best,
You're welcome!
I suggest that you use a macrovariable to retrieve the column name that you want to include in the array statement.
You just need to adapt the "where" clause in Proc SQL to specify which variables you don't want.
proc sql noprint;
select name
into:mylist separated by " "
from dictionary.columns
where libname="WORK" and memname="TEMP" and name not in ("ID" "Miles"); /* -> select your columns*/
quit;
data want;
set temp;
array _miss (*) &mylist;
do i=1 to dim(_miss);
if _miss(i) = . then _miss(i) = 0;
end;
drop i;
run;
Best,
I think this is along the lines of what I want. Although, when I run the proc sql with my actual data, I get "No rows were selected". Any ideas why? I tried the test data and it works.
Edit: I solved it, the memname has to be in uppercase! Thanks!!
Awesome!
Quick question, if I also wanted to divide this list of variables by their corresponding row entry for another variable, e.g., Miles, can I somehow do it in that data-step?
Edit: Got it, in the loop I can just put _miss(i)/miles
thanks!
@UniversitySas wrote:
Quick question, if I also wanted to divide this list of variables by their corresponding row entry for another variable, e.g., Miles, can I somehow do it in that data-step?
Quick answer ... don't do this. If you really need separate analyses for the different values of another variable ... there are better ways to do this such as using formats or using BY variables.
You can do this by adding a statement in the do loop:
proc sql noprint;
select name
into:mylist separated by " "
from dictionary.columns
where libname="WORK" and memname="TEMP" and name not in ("ID" "Miles"); /* -> select your columns*/
quit;
data want;
set temp;
array _miss (*) &mylist;
do i=1 to dim(_miss);
if _miss(i) = . then _miss(i) = 0;
_miss(i) = _miss(i) / /*your_variable*/;
end;
drop i;
run;
In fact, the array is just a way to repeat the same manipulation on a series of columns.
In practice, the column can be identify by its position in the array:
e.g. _miss(1) corresponds to the first column referenced in the array, so Windows
e.g. _miss(2) corresponds to the second column referenced in the array, so Occupants
...
So instead of writing
If Windows = . then Windows =0; Windows = Windows / x; If Occupants = . then Occupants =0; Occupant = Occupants / x;
You can write:
If _miss(1) = . then _miss(1) =0; _miss(1) = _miss(1) / x; If _miss(2) = . then _miss(2) =0; _miss(2) = _miss(2) / x;
... and what is better -> loop through the position 1, 2, ...
Best,
Alternatively, consider the do over
data want;
set temp;
array mis Windows Occupants;
do over mis;
if mis=. then mis=0;
end;
run;
proc stdize data=temp out=want reponly missing=0;
var Windows Occupants;
run;
Setting values to zero when they are missing seems like a dangerous thing to do, it will change your averages and every other statistic as well. Does it really make any sense to say if Occupants is missing, that's really a zero? Such a modification of the data should only be done if you have a good understanding of each variable and why it might be missing, and without such reasoning, I would not recommend doing this.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.