Hello everyone,
Im a sas beginner and I am trying to count the number of unique values across a range of columns (within the same row)
For example, I have 130 columns that each contain either missing values or the names of a country and sometimes the same country name repeats. I want to create a separate variable that counts the number of unique values within the range.
e.g.:
Col1 Col2 Col3 Col4 ..... Col131 Count
U.S. U.S. . 1
Korea U.S. China 3
The example was for illustrative effect.
Thanks for your help!
Sorry, you're right. I was counting the wrong thing (getting the count for the most frequently occurring country). Luckily, what you want is easier. Here's a better version:
data want;
set have;
array col {131} col1-col131;
array new {131} $20 _temporary_;
do _n_=1 to 131;
new{_n_} = col{_n_};
end;
call sortc(of new{*});
count = (new{1} > ' ');
do _n_=2 to 131;
if new{_n_} ne new{_n_-1} then count + 1;
end;
run;
There are likely a number of ways using arrays and lots of logic to search within the row but here is an approach that reshapes the data and uses other procedures to the desired count and the merges the result back to the data.
data have;
infile datalines missover;
informat Col1 - Col3 $10.;
input Col1 Col2 Col3 ;
row = _n_;
datalines;
U.S. U.S.
Korea U.S. China
;
run;
proc transpose data=have out=trans;
by row;
var col: ;
run;
proc sql;
create table levels as
select row, count(col1) as count
from (select distinct row, col1 from trans
where not missing(col1))
group by row;
quit;
Proc sql;
create table want as
select have.*, levels.count
from have left join levels on
have.row = levels.row
;
quit;
Thanks for your quick reply Ballard...Im getting a little bit lost in the table references though... This is what I tried to come up with modeled after your code...but it's obviously not working very well cause Im having trouble following.
proc transpose data=adtr.as7 out=adtr.as8;
by year_cusip;
var nation_: ;
run;
proc sql;
create table adtr.levels as
select year_cusip, count(col1) as count
from (select distinct year_cusip, col1 from trans
where not missing(col1))
group by year_cusip;
quit;
Proc sql;
create table adtr.test as
select adtr.levels, levels.count
from adtr.levels left join levels on
adtr.levels.year_cusip = levels.row
;
quit;
Thanks for your help
@r4321 wrote:
Thanks for your quick reply Ballard...Im getting a little bit lost in the table references though... This is what I tried to come up with modeled after your code...but it's obviously not working very well cause Im having trouble following.
Proc sql;
create table adtr.test as
select adtr.levels, levels.count
from adtr.levels left join levels on
adtr.levels.year_cusip = levels.row
;
quit;
Thanks for your help
Instead of adtr.levels use the name of the base data adtr.as7 (?) and
join on levels.year_cuspid.
You should ahve gotten an error about the variable ROW not in the set Levels.
When discussing code please include log with any error messages.
data want;
set have;
array Country(*) col1-col131;
CntryCount=0;
do i=2 to 131;
if Country(i) ne . and Country(i) ne Country(i-1) then CntryCount+1;
end;
run;
Maybe start out with something like this (untested).
Thank you...
I tried this:
data adtr.as8;
set adtr.as7;
array Country(*) nation_1-nation_131;
CntryCount=0;
do i=2 to 131;
if Country(i) ne . and Country(i) ne Country(i-1) then CntryCount+1;
end;
run;
However, I get a number of errors.
sample:
NOTE: Invalid numeric data, 'France' , at line 2142 column 8.
NOTE: Invalid numeric data, 'Switzerland' , at line 2142 column 8.
NOTE: Invalid numeric data, 'United States' , at line 2142 column 8
Thanks
Here's a variation on the array approach:
data want;
set have;
array col {131} col1-col131;
array new {131} $20 _temporary_;
do _n_=1 to 131;
new{_n_} = col{_n_};
end;
call sortc(of new{*});
maxcount=0;
current_count=1;
next_country = new{1};
do _n_=2 to 131;
if next_country = new{_n_} then do;
current_count + 1;
if _n_=131 and next_country > ' ' then maxcount = max(maxcount, current_count);
end;
else do;
if next_country > ' ' then maxcount = max(maxcount, current_count);
next_country = new{_n_};
current_count=1;
if _n_=131 then maxcount = max(maxcount, 1);
end;
end;
drop current_count next_country;
run;
It's untested code so there may be a little work needed here, but it might work as is.
It seems as though the code runs.. But it doesn't quite return the desired result, the count is inaccurate.
Sorry, you're right. I was counting the wrong thing (getting the count for the most frequently occurring country). Luckily, what you want is easier. Here's a better version:
data want;
set have;
array col {131} col1-col131;
array new {131} $20 _temporary_;
do _n_=1 to 131;
new{_n_} = col{_n_};
end;
call sortc(of new{*});
count = (new{1} > ' ');
do _n_=2 to 131;
if new{_n_} ne new{_n_-1} then count + 1;
end;
run;
Awesome ...This worked great! So Im getting something out of this, would you mind checking my understanding?
data adtr.as8;
set adtr.as7;
array col {131} nation_1-nation_131; ****creates an array for nation_1-nation_131
array new {131} $20 _temporary_; ****creates an identical array for nation_1-nation_131 in order to do comparisons
do _n_=1 to 131; ***tells it go through 131 iterations
new{_n_} = col{_n_}; ***comparing the cells
end;
call sortc(of new{*}); **sorts the log to facilitate comparisons
count = (new{1} > ' '); **only count if not missing
do _n_=2 to 131; **this array tells it to count the number of unique values in the array.
if new{_n_} ne new{_n_-1} then count + 1;
end;
run;
You have the right idea. I just wanted to embellish a few points.
do _n_=1 to 131 is for copying the original variables into the temporary array.
The reason for creating a temporary array holding a copy of the variables: the values need to be sorted. This would be possible, but the wrong thing to do:
call sortc(of col{*});
It would change the values of your existing variables, which is likely a bad outcome.
count= sets up an initial value for COUNT of 0 (if the first element after sorting is missing) or 1 (if the first element after sorting is nonmissing).
do _n_=2 to 131 goes through the remaining elements in the array to see if there is a change in the value. On the next line, any change triggers the program to increment COUNT.
Hello
This code worked for me as well.Thank you
but the problem is that it counted the missing value as well. for instance, for the row like 4 5 4 . it showed 3.
how can i modify the code to solve this?
thanks a lot
It is very easy for IML. Do you like IML code ?
data have;
infile datalines missover;
informat Col1 - Col3 $10.;
input Col1 Col2 Col3 ;
datalines;
U.S. U.S.
Korea U.S. China
;
run;
data want;
if _n_=1 then do;
length k $ 80;
declare hash h();
h.definekey('k');
h.definedone();
end;
set have;
array x{*} $ col: ;
do i=1 to dim(x);
if not missing(x{i}) then do;
k=x{i};h.replace();
end;
end;
count=h.num_items;
h.clear();
drop i k;
run;
UPDATED
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.