Counting Distinct Entries by Variables

1800bigk · Posted 09-23-2011 01:29 PM

I am having problems counting things in SAS. I have a data set with 3 vars: Level1, Level2 and FeatureNbr. I want to be able to count the number of DISTINCT FeatureNbrs for each possible value of Level1 and Level2.

SAMPLE DATA
Level1	Level2	FeatureNbr
Ops	SVCS	23654
Ops	Forms	23654
Ops	Supplies	23654
Ops	Other	23654
People	Salaries	23654
People	Fleet	27589
People	Benefits	27589
People	Other	27589
IT	Equipment	27589
IT	Programming	27589
IT	Phone	36258
Infrastructure	Equipment	36258
Infrastructure	Real Estate	36258
Infrastructure	Other	36258

My normal way to count is with proc summary but that counts each row so I found some proc sql code that does a select count(distinct(featureNbr)) on my data but it just returns one number.

ANy help would be greatly appreciated.

FriedEgg · Posted 09-23-2011 01:32 PM

proc sql;

select level1, level2, count(distinct(featurenbr)) as count

from sample_data

group by level1, level2

quit;

art297 · Posted 09-23-2011 01:47 PM

But I am concerned why you stated that you couldn't use proc summary. In your sample data, all of the counts will equal 1 by definition. In the following, I added an extra data line so that one of the combinations would result in a value of 2:

data have;

informat Level1 $15.;

informat Level2 $15.;

input Level1 Level2 & FeatureNbr;

cards;

Ops SVCS 23654

Ops Forms 23654

Ops Supplies 23654

Ops Other 23654

People Salaries 23654

People Fleet 27589

People Benefits 27589

People Other 27589

IT Equipment 27589

IT Programming 27589

IT Phone 36258

Infrastructure Equipment 36258

Infrastructure Real Estate 36258

Infrastructure Other 36258

;

proc summary data=have nway;

var FeatureNbr;

class level1 level2;

output out=want (drop=_:) n=;

run;

1800bigk · Posted 09-23-2011 02:16 PM

When I did a proc summary to count it was double counting because it would count each row as a new feature but as you can see there are only 3 distinct features in my sample. I had to re login to the sas server and re load my data so I haven't tired anything yet, thanks for the input.

Tom · Posted 09-23-2011 02:27 PM

What do you want as the output? You were concerned about getting only one number, but then you say that you only want one number for the variable FEATURE. If you look for distinct values of the variable FEATURE nested within the other two variables (see code above from FriedEgg) then only one combination has more than one distinct value of feature.

If you want the number of distinct levels for each variable then try PROC FREQ with the NLEVELS option. That should tell you that the variable FEATURE has three distinct values.

art297 · Posted 09-23-2011 02:29 PM

I misunderstood what you were seeking and may still not understand your request.

I would go with FriedEgg's suggested proc sql code, but make sure that you add a semicolon before the quit statement:

proc sql;

select level1, level2,

count(distinct(featurenbr)) as count

from have

group by level1, level2

;

quit;

However, that will give you a count of 1 for each unique level1*level2*featurenbr combination in your data.

1800bigk · Posted 09-23-2011 02:39 PM

I want to be able to report out on the number of unique features that occured for any given variable in my data. I am begienning to realize that this is much more difficult that I thought. I have 41 variables and zillions of rows and my boss would like me to get the data to a point where we can see how many unique features roll up to any variable in my data. So even though there is 7gb worth of rows there are only a few thousand possible places that a feature could roll up to. It looks like I will be doing 41 different proc sql like mentioned above so I can get counts for all my variables.

art297 · Posted 09-23-2011 02:46 PM

I think we are all miscommunicating. What output would you expect to see from your sample data?

1800bigk · Posted 09-23-2011 02:55 PM

Something along the lines of

Level2	Count of Distinct Features	Level1	Count of Distinct Features
OPS	1	Benefits	1
People	2	Equipment	2
IT	2	Fleet	1
Infrastructure	1	Forms	1
Other	3
Phone	1
Programming	1
Real Estate	1
Salaries	1
Supplies	1
SVCS	1

art297 · Posted 09-23-2011 03:02 PM

Still not sure, but possibly something like?:

proc sql;

select level1,

count(distinct(featurenbr)) as count

from have

group by level1

;

select level2,

count(distinct(featurenbr)) as count

from have

group by level2

;

quit;

FriedEgg · Posted 09-23-2011 04:59 PM

data have;

infile cards dsd dlm='09'x;

input (level1-level2) (:$20.) featurenbr;

cards;

Ops SVCS 23654

Ops Forms 23654

Ops Supplies 23654

Ops Other 23654

People Salaries 23654

People Fleet 27589

People Benefits 27589

People Other 27589

IT Equipment 27589

IT Programming 27589

IT Phone 36258

Infrastructure Equipment 36258

Infrastructure Real Estate 36258

Infrastructure Other 36258

Ops SVCS 23654

Ops Forms 23654

Ops Supplies 23654

Ops Other 23654

People Salaries 23654

People Fleet 27589

People Benefits 27589

People Other 27589

IT Equipment 27589

IT Programming 27589

IT Phone 36258

Infrastructure Equipment 36258

Infrastructure Real Estate 36258

Infrastructure Other 36258

Ops SVCS 23654

Ops Forms 23654

Ops Supplies 23654

Ops Other 23654

People Salaries 23654

People Fleet 27589

People Benefits 27589

People Other 27589

IT Equipment 27589

IT Programming 27589

IT Phone 36258

Infrastructure Equipment 36258

Infrastructure Real Estate 36258

Infrastructure Other 36258

Ops SVCS 23654

Ops Forms 23654

Ops Supplies 23654

Ops Other 23654

People Salaries 23654

People Fleet 27589

People Benefits 27589

People Other 27589

IT Equipment 27589

IT Programming 27589

IT Phone 36258

Infrastructure Equipment 36258

Infrastructure Real Estate 36258

Infrastructure Other 36258

Infrastructure Equipment 27589

Infrastructure Real Estate 27589

Infrastructure Other 27589

;

run;

proc sql;

create view vhave as

select distinct featurenbr, level1, level2

from have;

quit;

proc summary data=vhave;

var featurenbr;

class level2 level1;

output out=want(rename=(_type_=level) where=(level>0) drop=_freq_) n=count;

run;

data level1(drop=level2) level2(drop=level1) cross;

set want;

select(level);

when(1) output level1;

when(2) output level2;

otherwise output cross;

end;

drop level;

run;

DLing · Posted 09-23-2011 04:59 PM

Summarize twice:

proc summary data=have missing nway;

class level1 feature;

output out=temp(drop=_:) n=;

run;

proc freq data=temp;

table level1;

run;

Repeat again for level2. Clumsy but works and generates additional data that may be useful later.

Tom · Posted 09-25-2011 05:29 PM

You seem to want to count the number of distinct values for a single variable (&COUNTVAR) when crossed with each of a number of other variables (&VARLIST). You can use the TYPES statement in PROC SUMMARY to generate a dataset that will make it easy to get those counts using PROC FREQ. The only problem I see is when your variables have missing values.

%let varlist=level1 level2 ;

%let countvar=featurenbr ;

proc summary data=have nway ;

class &varlist &countvar;

types (&varlist)*&countvar ;

output out=want1 ;

run;

proc freq data=want1 ;

tables &varlist ;

run;

Howles · Posted 09-26-2011 09:23 AM

Note that NWAY contradicts the specifications in the TYPES statement. Apparently SAS ignores NWAY in this situation.