I am having problems counting things in SAS. I have a data set with 3 vars: Level1, Level2 and FeatureNbr. I want to be able to count the number of DISTINCT FeatureNbrs for each possible value of Level1 and Level2.
SAMPLE DATA | ||
Level1 | Level2 | FeatureNbr |
Ops | SVCS | 23654 |
Ops | Forms | 23654 |
Ops | Supplies | 23654 |
Ops | Other | 23654 |
People | Salaries | 23654 |
People | Fleet | 27589 |
People | Benefits | 27589 |
People | Other | 27589 |
IT | Equipment | 27589 |
IT | Programming | 27589 |
IT | Phone | 36258 |
Infrastructure | Equipment | 36258 |
Infrastructure | Real Estate | 36258 |
Infrastructure | Other | 36258 |
My normal way to count is with proc summary but that counts each row so I found some proc sql code that does a select count(distinct(featureNbr)) on my data but it just returns one number.
ANy help would be greatly appreciated.
proc sql;
select level1, level2, count(distinct(featurenbr)) as count
from sample_data
group by level1, level2
quit;
But I am concerned why you stated that you couldn't use proc summary. In your sample data, all of the counts will equal 1 by definition. In the following, I added an extra data line so that one of the combinations would result in a value of 2:
data have;
informat Level1 $15.;
informat Level2 $15.;
input Level1 Level2 & FeatureNbr;
cards;
Ops SVCS 23654
Ops Forms 23654
Ops Supplies 23654
Ops Other 23654
People Salaries 23654
People Fleet 27589
People Benefits 27589
People Other 27589
IT Equipment 27589
IT Programming 27589
IT Programming 27589
IT Phone 36258
Infrastructure Equipment 36258
Infrastructure Real Estate 36258
Infrastructure Other 36258
;
proc summary data=have nway;
var FeatureNbr;
class level1 level2;
output out=want (drop=_:) n=;
run;
When I did a proc summary to count it was double counting because it would count each row as a new feature but as you can see there are only 3 distinct features in my sample. I had to re login to the sas server and re load my data so I haven't tired anything yet, thanks for the input.
What do you want as the output? You were concerned about getting only one number, but then you say that you only want one number for the variable FEATURE. If you look for distinct values of the variable FEATURE nested within the other two variables (see code above from FriedEgg) then only one combination has more than one distinct value of feature.
If you want the number of distinct levels for each variable then try PROC FREQ with the NLEVELS option. That should tell you that the variable FEATURE has three distinct values.
I misunderstood what you were seeking and may still not understand your request.
I would go with FriedEgg's suggested proc sql code, but make sure that you add a semicolon before the quit statement:
proc sql;
select level1, level2,
count(distinct(featurenbr)) as count
from have
group by level1, level2
;
quit;
However, that will give you a count of 1 for each unique level1*level2*featurenbr combination in your data.
I want to be able to report out on the number of unique features that occured for any given variable in my data. I am begienning to realize that this is much more difficult that I thought. I have 41 variables and zillions of rows and my boss would like me to get the data to a point where we can see how many unique features roll up to any variable in my data. So even though there is 7gb worth of rows there are only a few thousand possible places that a feature could roll up to. It looks like I will be doing 41 different proc sql like mentioned above so I can get counts for all my variables.
I think we are all miscommunicating. What output would you expect to see from your sample data?
Something along the lines of
Level2 | Count of Distinct Features | Level1 | Count of Distinct Features |
OPS | 1 | Benefits | 1 |
People | 2 | Equipment | 2 |
IT | 2 | Fleet | 1 |
Infrastructure | 1 | Forms | 1 |
Other | 3 | ||
Phone | 1 | ||
Programming | 1 | ||
Real Estate | 1 | ||
Salaries | 1 | ||
Supplies | 1 | ||
SVCS | 1 |
Still not sure, but possibly something like?:
proc sql;
select level1,
count(distinct(featurenbr)) as count
from have
group by level1
;
select level2,
count(distinct(featurenbr)) as count
from have
group by level2
;
quit;
data have;
infile cards dsd dlm='09'x;
input (level1-level2) (:$20.) featurenbr;
cards;
Ops SVCS 23654
Ops Forms 23654
Ops Supplies 23654
Ops Other 23654
People Salaries 23654
People Fleet 27589
People Benefits 27589
People Other 27589
IT Equipment 27589
IT Programming 27589
IT Phone 36258
Infrastructure Equipment 36258
Infrastructure Real Estate 36258
Infrastructure Other 36258
Ops SVCS 23654
Ops Forms 23654
Ops Supplies 23654
Ops Other 23654
People Salaries 23654
People Fleet 27589
People Benefits 27589
People Other 27589
IT Equipment 27589
IT Programming 27589
IT Phone 36258
Infrastructure Equipment 36258
Infrastructure Real Estate 36258
Infrastructure Other 36258
Ops SVCS 23654
Ops Forms 23654
Ops Supplies 23654
Ops Other 23654
People Salaries 23654
People Fleet 27589
People Benefits 27589
People Other 27589
IT Equipment 27589
IT Programming 27589
IT Phone 36258
Infrastructure Equipment 36258
Infrastructure Real Estate 36258
Infrastructure Other 36258
Ops SVCS 23654
Ops Forms 23654
Ops Supplies 23654
Ops Other 23654
People Salaries 23654
People Fleet 27589
People Benefits 27589
People Other 27589
IT Equipment 27589
IT Programming 27589
IT Phone 36258
Infrastructure Equipment 36258
Infrastructure Real Estate 36258
Infrastructure Other 36258
Infrastructure Equipment 27589
Infrastructure Real Estate 27589
Infrastructure Other 27589
;
run;
proc sql;
create view vhave as
select distinct featurenbr, level1, level2
from have;
quit;
proc summary data=vhave;
var featurenbr;
class level2 level1;
output out=want(rename=(_type_=level) where=(level>0) drop=_freq_) n=count;
run;
data level1(drop=level2) level2(drop=level1) cross;
set want;
select(level);
when(1) output level1;
when(2) output level2;
otherwise output cross;
end;
drop level;
run;
Summarize twice:
proc summary data=have missing nway;
class level1 feature;
output out=temp(drop=_:) n=;
run;
proc freq data=temp;
table level1;
run;
Repeat again for level2. Clumsy but works and generates additional data that may be useful later.
You seem to want to count the number of distinct values for a single variable (&COUNTVAR) when crossed with each of a number of other variables (&VARLIST). You can use the TYPES statement in PROC SUMMARY to generate a dataset that will make it easy to get those counts using PROC FREQ. The only problem I see is when your variables have missing values.
%let varlist=level1 level2 ;
%let countvar=featurenbr ;
proc summary data=have nway ;
class &varlist &countvar;
types (&varlist)*&countvar ;
output out=want1 ;
run;
proc freq data=want1 ;
tables &varlist ;
run;
Note that NWAY contradicts the specifications in the TYPES statement. Apparently SAS ignores NWAY in this situation.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.