hi all; may you please assist. I have a dataset where one account can be in multiple industries. How do I calculate the number of industries in a data step.
Below is the example of dataset:
Account number Industry
148060 IN
148060 CL
148060 CC
148060 HW
148060 FC
92865 PL
150021 PL
150021 MB
150021 NL
the expected outcome is:
Account number number of industries
148060 5
92865 1
150021 3
thanks ahead of time for your assistance.
@PeterClemmensen Thanks for providing test-data in an usable form 😉
Here is data-step, assuming that your source dataset is not sorted, but grouped by AccountNumber.
data want;
set have(keep=AccountNumber);
by AccountNumber notsorted;
attrib noi length= 8 label= 'number of industries';
retain noi;
if first.AccountNumber then do;
noi = 0;
end;
noi = noi + 1;
if last.AccountNumber then do;
output;
end;
run;
Why limit yourself to a data step?
data have;
input Accountnumber $ Industry $;
datalines;
148060 IN
148060 CL
148060 CC
148060 HW
148060 FC
92865 PL
150021 PL
150021 MB
150021 NL
;
proc sql;
create table want as
select Accountnumber
,count(distinct Industry) as number_of_industries
from have
group by Accountnumber;
quit;
@PeterClemmensen Thanks for providing test-data in an usable form 😉
Here is data-step, assuming that your source dataset is not sorted, but grouped by AccountNumber.
data want;
set have(keep=AccountNumber);
by AccountNumber notsorted;
attrib noi length= 8 label= 'number of industries';
retain noi;
if first.AccountNumber then do;
noi = 0;
end;
noi = noi + 1;
if last.AccountNumber then do;
output;
end;
run;
Hi @andreas_lds I'm afraid should the requirement be count of "distinct industries" for each account number, your code would yield incorrect results, like for the revised "have"
data have;
input Accountnumber $ Industry $;
datalines;
148060 IN
148060 CL
148060 CL
148060 CC
148060 HW
148060 FC
92865 PL
92865 PL
150021 PL
150021 MB
150021 NL
;
Of course a simple proc sort and nodupkey will offset the problem:
proc sort data=have out=_have nodupkey;
by Accountnumber industry;
run;
data want;
set _have(keep=AccountNumber);
by AccountNumber notsorted;
attrib noi length= 8 label= 'number of industries';
retain noi;
if first.AccountNumber then do;
noi = 0;
end;
noi = noi + 1;
if last.AccountNumber then do;
output;
end;
run;
/*Or a DOW fun*/
proc sort data=have out=_have nodupkey;
by Accountnumber industry;
run;
data want;
do number_of_industries= 1 by 1 until(last.Accountnumber);
set _have;
by Accountnumber ;
end;
drop industry;
run;
Post test data in the form of a datastep.
data have;
input Account Industry $;
datalines;
148060 IN
148060 CL
148060 CC
148060 CL
148060 HW
148060 FC
;
run;
proc freq data=have;
tables account*industry / out=want;
run;
How do I calculate the number of industries in a data step?
I just don't understand why people limit themselves to doing this in a data step, thus writing their own code; when SAS has created a procedure to do this, and added in lots amount of error checking and verification of results. See @RW9's answer.
@PaigeMiller Good morning, I am afraid RW's proc freq will not produce the results that OP wants for the reason Proc freq counts the number of occurrences of values however OP wants the total number of distinct occurrences within a by group.
This is easily fixed by doing two consecutive PROC FREQs
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.