tricky subsetting problem

jdub · Posted 09-12-2011 04:13 PM

Howdy

I have a file of employment data corresponding to 9 counties that I wish to aggregate across counties by NAICS. Within the file are employment observations for many different NAICS for each county. I wish to aggregate the employment obs. over all the counties by NAICS. I can do this with proc means or proc summary.

There is a problem though.

Some of the counties have observations that are missing. If I use proc means to aggregate the data, it treats the missing values as zeros. I would like to output an aggregated dataset for only those NAICS that have employment values for all nine counties. If a single county has a missing value I do not want to aggregate the employment observations. My thought is to subset the original data so that it contains data on NAICS with observed values for all nine counties. I know how to drop observations if a observation is missing but I do not know how to drop an entire class (NAICS) if one or more county observations in that class are missing.

Is there some exceptional SAS-pert that can help me?

art297 · Posted 09-12-2011 04:28 PM

The easiest way I can think of is to use proc sql to either pre-process your file and add a field that is a count of the NAICS for the particular county and then use that field in a where statement as part of your proc summary, or simply do the entire analysis in sql using the count in a having statement.

If you need an example, post some code for at least one field for a couple of counties, one that has data for all nine and one that doesn't.

jdub · Posted 09-12-2011 05:02 PM

Hi art297

An example would help me out greatly. Here is a snipet of data:

Here I have one missine observation for employment in county 1 for naics 442. I am trying to generate a dataset that strips all observations for naics 442 from the original set. In the original data I have many naics with and without missing observations across the nine counties.

county	naics3	Emp
1	442
1	447	43
41	447	34
67	442	12
67	447	80
69	442	17
69	447	208
73	442	290
73	447	531
85	442	80
85	447	122
97	442	109
97	447	294
125	442	25
125	447	144
141	442	76
141	447	407

DLing · Posted 09-12-2011 04:36 PM

Why don't you use proc means to first create a dataset with the number of records with non-missing values, # with missing values, and the total, i.e., output N, NMISS and SUM, then you can select only those with N=9 or NMISS=0. Something like:

proc summary data=a missing nway;
class naics;
var employ;

     output out=summary(drop=_:)
          n=n    nmiss=nmiss    sum=total
     ;
run;

You can then check if nmiss is 0 or not.

Peter_C · Posted 09-20-2011 08:02 AM

Dling has the top solution, on which it is hard to imporove, except perhaps applying the "check" as the table is written, like:

output out=summary(drop=_: where=( not miss ) )

art297 · Posted 09-12-2011 05:38 PM

Here is an example of a proc sql solution:

data have;

input county naics3 Emp;

cards;

1 442 .

1 447 43

41 447 34

67 442 12

67 447 80

69 442 17

69 447 208

73 442 290

73 447 531

85 442 80

85 447 122

97 442 109

97 447 294

125 442 25

125 447 144

141 442 76

141 447 407

;

proc sql noprint;

create table want as

select naics3,count(emp) as count,sum(emp) as total_emp

from have

group by naics3

having count(emp) eq 9

;

quit;

Fisher · Posted 09-12-2011 06:27 PM

proc sql;

create table want as

select * from have

where naics3 not in (select distinct naics3 from have where emp is missing);

quit;