## How to run means on many variables upon their conditions (subset selection)?

Frequent Contributor
Posts: 75

# How to run means on many variables upon their conditions (subset selection)?

I want to run proc means on several continuous variables, for example Var1, Var2, Var3, but want to leave out certain observations because their values are not appropriate for analysis. For example, some observations/respondents may receive code of 99999999 on any, some, or all of these variables if those respondents were not asked such questions on the survey, like men generally aren't asked how much they spend each month on lotions (well, some surveys may ask that question and I've got no problem with that, but just to say).

It can also be the case where there are more than one values that we wish to leave out. For example, some observations receive values of 88888888 if they answered "Don't know". So these are not missing values as they did answer, but at the same time not legitimate for calculating (means) either.

If it's one variable, e.g., Var1, then I can do:

proc means data=have;

where Var1 not in (88888888, 99999999);

var Var1;

run;

But I'm not sure how to make it efficient in the case of multiple variables. I think of a two-step solution but then figure out it won't work, like:

data have; set have;

Var4=0;

array myarr

Var1 Var2 Var3;

do i = 1 to 3;

if myarr(i) in (88888888, 99999999) then Var4=1;

end; drop i; run;

proc means data=have;

where Var4 ne 1;

var Var1 Var2 Var3;

run;

It won't work because the array assigns 1 to Var4 if any from Var1 - Var3 receives one of these values, thus conditioning on Var4 inappropriately leaves out legitimate observations of variables.

Super User
Posts: 13,365

## Re: How to run means on many variables upon their conditions (subset selection)?

How many variables? How many combinations of values to exclude. If the values you are excluding are in general invalid or a code to indicate specific forms of missing data it may be easiest in the long run to create new variables where those values are set to missing.

Retaining the original variable allows a double check on any raw values. OR delve into to the world of missing indicated as .A, .B and such and custom formats to display those as "Not asked" "Don't Know" or similar.

data want; /* I am very leery of using the data have; set have; when recoding variables as an error may go undetected and you may not be able to determine HOW that error occurred later*/

set have;

array myarr Var1 Var2 Var3;

array mynewvar NewVar1 NewVar2 NewVar3;

do i = 1 to dim (myvar);

if myarr(i) in (88888888, 99999999) then MyNewVar(I) = . ;

else MyNewVar(I) = myarr(I);

end;

drop i;

run;

Group the variables of similar coding together and assign missing.

Then proc means, summary and surveymeans, surveyfreq will give you pretty much intended results when asking for the statistics on the NewVar variables.

The other approach involves creating custom informats and formats for use when reading.

Here is a brief example:

proc format library=work;
invalue xmiss
888888 = .A
999999 = .B
other = _same_;
;
value xmiss
.A= "Don't Know"
;

run;

data junk;
informat x xmiss.;
input x;
format x xmiss.;
datalines ;
888888
1
2
3
4
999999
;
run;

proc means data=junk;
var x;
format x xmiss.;
run;

Super User
Posts: 23,365

## Re: How to run means on many variables upon their conditions (subset selection)?

The most efficient method is to clean your data first overall, I'd also be cautious about using your where clause like that in your Proc Means.

Depending on how you plan to do your analysis in the future saying (where Var1 not in (88888888, 99999999) then excludes that observation for all other variables. So if a male hasn't been asked how much he's spent on cosmetics that would also exclude varX which might be how much did someone spend on takeout.

PS I have some male friends, straight and gay, that spend more on cosmetics than I ever do...

Super User
Posts: 6,648

## Re: How to run means on many variables upon their conditions (subset selection)?

Borrowing ideas from both of the responses you have gotten so far ...

It was a mistake to store values of 88888888 and 99999999.  You are better off fixing that mistake first rather than trying to code around it.  The fix is relatively straightforward:

data want;

set have;

array q {500} var1-var500;

do _i_=1 to 500;

if q{_i_} = 88888888 then q{_i_}=.A;

else if q{_i_} = 99999999 then q{_i_}=.B;

* More conditions if needed, you can keep on going all the way through .Z;

end;

run;

Then PROC MEANS will automatically remove the missing values, but you can still differentiate one type of missing value from another.

Good luck.

Super User
Posts: 13,365