How to run means on many variables upon their conditions (subset selec...

NonSleeper · Posted 07-24-2015 12:33 AM

I want to run proc means on several continuous variables, for example Var1, Var2, Var3, but want to leave out certain observations because their values are not appropriate for analysis. For example, some observations/respondents may receive code of 99999999 on any, some, or all of these variables if those respondents were not asked such questions on the survey, like men generally aren't asked how much they spend each month on lotions (well, some surveys may ask that question and I've got no problem with that, but just to say).

It can also be the case where there are more than one values that we wish to leave out. For example, some observations receive values of 88888888 if they answered "Don't know". So these are not missing values as they did answer, but at the same time not legitimate for calculating (means) either.

If it's one variable, e.g., Var1, then I can do:

proc means data=have;

where Var1 not in (88888888, 99999999);

var Var1;

run;

But I'm not sure how to make it efficient in the case of multiple variables. I think of a two-step solution but then figure out it won't work, like:

data have; set have;

Var4=0;

array myarr

Var1 Var2 Var3;

do i = 1 to 3;

if myarr(i) in (88888888, 99999999) then Var4=1;

end; drop i; run;

proc means data=have;

where Var4 ne 1;

var Var1 Var2 Var3;

run;

It won't work because the array assigns 1 to Var4 if any from Var1 - Var3 receives one of these values, thus conditioning on Var4 inappropriately leaves out legitimate observations of variables.

ballardw · Posted 07-24-2015 11:13 AM

How many variables? How many combinations of values to exclude. If the values you are excluding are in general invalid or a code to indicate specific forms of missing data it may be easiest in the long run to create new variables where those values are set to missing.

Retaining the original variable allows a double check on any raw values. OR delve into to the world of missing indicated as .A, .B and such and custom formats to display those as "Not asked" "Don't Know" or similar.

data want; /* I am very leery of using the data have; set have; when recoding variables as an error may go undetected and you may not be able to determine HOW that error occurred later*/

set have;

array myarr Var1 Var2 Var3;

array mynewvar NewVar1 NewVar2 NewVar3;

do i = 1 to dim (myvar);

if myarr(i) in (88888888, 99999999) then MyNewVar(I) = . ;

else MyNewVar(I) = myarr(I);

end;

drop i;

run;

Group the variables of similar coding together and assign missing.

Then proc means, summary and surveymeans, surveyfreq will give you pretty much intended results when asking for the statistics on the NewVar variables.

The other approach involves creating custom informats and formats for use when reading.

Here is a brief example:

proc format library=work;
invalue xmiss
888888 = .A
999999 = .B
other = _same_;
;
value xmiss
.A= "Don't Know"
.B= "Not Asked"
;

run;

data junk;
   informat x xmiss.;
   input x;
   format x xmiss.;
datalines ;
888888
1
2
3
4
999999
;
run;

proc means data=junk;
var x;
format x xmiss.;
run;

Reeza · Posted 07-24-2015 11:24 AM

The most efficient method is to clean your data first overall, I'd also be cautious about using your where clause like that in your Proc Means.

Depending on how you plan to do your analysis in the future saying (where Var1 not in (88888888, 99999999);) then excludes that observation for all other variables. So if a male hasn't been asked how much he's spent on cosmetics that would also exclude varX which might be how much did someone spend on takeout.

PS I have some male friends, straight and gay, that spend more on cosmetics than I ever do...

Astounding · Posted 07-24-2015 11:57 AM

Borrowing ideas from both of the responses you have gotten so far ...

It was a mistake to store values of 88888888 and 99999999. You are better off fixing that mistake first rather than trying to code around it. The fix is relatively straightforward:

data want;

set have;

array q {500} var1-var500;

do _i_=1 to 500;

if q{_i_} = 88888888 then q{_i_}=.A;

else if q{_i_} = 99999999 then q{_i_}=.B;

* More conditions if needed, you can keep on going all the way through .Z;

end;

run;

Then PROC MEANS will automatically remove the missing values, but you can still differentiate one type of missing value from another.

Good luck.

ballardw · Posted 07-24-2015 12:42 PM

Astounding;

I agree that it would be a mistake to intentionally set values as mentioned. However the data source may have been a survey data collection organization of some sort and the OP received the data in this fashion. The coding of the individual questions may need to be looked at closely as different questions may use the same code differently, for example 88888888 as a "Not answered" in one question and "Not applicable" in another question. Hence my comment on grouping variables that were coded similarly for treatment.

As always, know your data before you start really helps.

How to run means on many variables upon their conditions (subset selection)?

Re: How to run means on many variables upon their conditions (subset selection)?

Re: How to run means on many variables upon their conditions (subset selection)?

Re: How to run means on many variables upon their conditions (subset selection)?

Re: How to run means on many variables upon their conditions (subset selection)?