Solved: Re: count unique variable values across columns (within same row)

r4321 · Posted 05-18-2016 11:10 AM

Hello everyone,

Im a sas beginner and I am trying to count the number of unique values across a range of columns (within the same row)

For example, I have 130 columns that each contain either missing values or the names of a country and sometimes the same country name repeats. I want to create a separate variable that counts the number of unique values within the range.

e.g.:

Col1 Col2 Col3 Col4 ..... Col131 Count

U.S. U.S. . 1

Korea U.S. China 3

The example was for illustrative effect.

Thanks for your help!

Astounding · Posted 05-18-2016 12:42 PM

Sorry, you're right. I was counting the wrong thing (getting the count for the most frequently occurring country). Luckily, what you want is easier. Here's a better version:

data want;

set have;

array col {131} col1-col131;

array new {131} $20 _temporary_;

do _n_=1 to 131;

new{_n_} = col{_n_};

end;

call sortc(of new{*});

count = (new{1} > ' ');

do _n_=2 to 131;

if new{_n_} ne new{_n_-1} then count + 1;

end;

run;

View solution in original post

ballardw · Posted 05-18-2016 11:43 AM

There are likely a number of ways using arrays and lots of logic to search within the row but here is an approach that reshapes the data and uses other procedures to the desired count and the merges the result back to the data.

data have;
   infile datalines missover;
   informat Col1 - Col3 $10.;
   input Col1      Col2       Col3   ;
   row = _n_;
datalines;
  U.S.      U.S.               
  Korea     U.S.        China    
;
run;

proc transpose data=have out=trans;
   by row;
   var col: ;
run;

proc sql;
   create table levels as
   select row, count(col1) as count
   from (select distinct row, col1 from trans
         where not missing(col1))
   group by row;
quit;

Proc sql;
   create table want as 
   select have.*, levels.count
   from have left join levels on 
      have.row = levels.row
   ;
quit;

r4321 · Posted 05-18-2016 11:58 AM

Thanks for your quick reply Ballard...Im getting a little bit lost in the table references though... This is what I tried to come up with modeled after your code...but it's obviously not working very well cause Im having trouble following.

proc transpose data=adtr.as7 out=adtr.as8;
by year_cusip;
var nation_: ;
run;

proc sql;
create table adtr.levels as
select year_cusip, count(col1) as count
from (select distinct year_cusip, col1 from trans
where not missing(col1))
group by year_cusip;
quit;

Proc sql;
create table adtr.test as
select adtr.levels, levels.count
from adtr.levels left join levels on
adtr.levels.year_cusip = levels.row
;
quit;

Thanks for your help

ballardw · Posted 05-18-2016 01:00 PM

@r4321 wrote:

Thanks for your quick reply Ballard...Im getting a little bit lost in the table references though... This is what I tried to come up with modeled after your code...but it's obviously not working very well cause Im having trouble following.

Proc sql;
create table adtr.test as
select adtr.levels, levels.count
from adtr.levels left join levels on
adtr.levels.year_cusip = levels.row
;
quit;

Thanks for your help

Instead of adtr.levels use the name of the base data adtr.as7 (?) and

join on levels.year_cuspid.

You should ahve gotten an error about the variable ROW not in the set Levels.

When discussing code please include log with any error messages.

JediApprentice · Posted 05-18-2016 11:48 AM

data want;
  set have;
  array Country(*) col1-col131;
  CntryCount=0;
  do i=2 to 131;
    if Country(i) ne . and Country(i) ne Country(i-1) then CntryCount+1;
  end;
run;

Maybe start out with something like this (untested).

r4321 · Posted 05-18-2016 12:02 PM

Thank you...

I tried this:

data adtr.as8;
set adtr.as7;
array Country(*) nation_1-nation_131;
CntryCount=0;
do i=2 to 131;
if Country(i) ne . and Country(i) ne Country(i-1) then CntryCount+1;
end;
run;

However, I get a number of errors.

sample:

NOTE: Invalid numeric data, 'France' , at line 2142 column 8.
NOTE: Invalid numeric data, 'Switzerland' , at line 2142 column 8.
NOTE: Invalid numeric data, 'United States' , at line 2142 column 8

Thanks

Astounding · Posted 05-18-2016 12:10 PM

Here's a variation on the array approach:

data want;

set have;

array col {131} col1-col131;

array new {131} $20 _temporary_;

do _n_=1 to 131;

new{_n_} = col{_n_};

end;

call sortc(of new{*});

maxcount=0;

current_count=1;

next_country = new{1};

do _n_=2 to 131;

if next_country = new{_n_} then do;

current_count + 1;

if _n_=131 and next_country > ' ' then maxcount = max(maxcount, current_count);

end;

else do;

if next_country > ' ' then maxcount = max(maxcount, current_count);

next_country = new{_n_};

current_count=1;

if _n_=131 then maxcount = max(maxcount, 1);

end;

drop current_count next_country;

run;

It's untested code so there may be a little work needed here, but it might work as is.

r4321 · Posted 05-18-2016 12:29 PM

It seems as though the code runs.. But it doesn't quite return the desired result, the count is inaccurate.

r4321 · Posted 05-18-2016 12:30 PM

Astounding · Posted 05-18-2016 12:42 PM

Sorry, you're right. I was counting the wrong thing (getting the count for the most frequently occurring country). Luckily, what you want is easier. Here's a better version:

data want;

set have;

array col {131} col1-col131;

array new {131} $20 _temporary_;

do _n_=1 to 131;

new{_n_} = col{_n_};

end;

call sortc(of new{*});

count = (new{1} > ' ');

do _n_=2 to 131;

if new{_n_} ne new{_n_-1} then count + 1;

end;

run;

r4321 · Posted 05-18-2016 12:56 PM

Awesome ...This worked great! So Im getting something out of this, would you mind checking my understanding?

data adtr.as8;
set adtr.as7;
array col {131} nation_1-nation_131; ****creates an array for nation_1-nation_131
array new {131} $20 _temporary_; ****creates an identical array for nation_1-nation_131 in order to do comparisons
do _n_=1 to 131; ***tells it go through 131 iterations
new{_n_} = col{_n_}; ***comparing the cells
end;
call sortc(of new{*}); **sorts the log to facilitate comparisons
count = (new{1} > ' '); **only count if not missing
do _n_=2 to 131; **this array tells it to count the number of unique values in the array.
if new{_n_} ne new{_n_-1} then count + 1;
end;
run;

Astounding · Posted 05-18-2016 01:17 PM

You have the right idea. I just wanted to embellish a few points.

do _n_=1 to 131 is for copying the original variables into the temporary array.

The reason for creating a temporary array holding a copy of the variables: the values need to be sorted. This would be possible, but the wrong thing to do:

call sortc(of col{*});

It would change the values of your existing variables, which is likely a bad outcome.

count= sets up an initial value for COUNT of 0 (if the first element after sorting is missing) or 1 (if the first element after sorting is nonmissing).

do _n_=2 to 131 goes through the remaining elements in the array to see if there is a change in the value. On the next line, any change triggers the program to increment COUNT.

fama · Posted 06-13-2017 03:49 AM

Hello

This code worked for me as well.Thank you

but the problem is that it counted the missing value as well. for instance, for the row like 4 5 4 . it showed 3.

how can i modify the code to solve this?

thanks a lot

Ksharp · Posted 05-18-2016 10:22 PM

It is very easy for IML. Do you like IML code ?

data have;
   infile datalines missover;
   informat Col1 - Col3 $10.;
   input Col1      Col2       Col3   ;
datalines;
  U.S.      U.S.               
  Korea     U.S.        China    
;
run;
data want;
 if _n_=1 then do;
  length k $ 80;
  declare hash h();
  h.definekey('k');
  h.definedone();
 end;
 set have;
 array x{*} $ col: ;
 do i=1 to dim(x);
  if not missing(x{i}) then do;
   k=x{i};h.replace();
  end;
 end;
 count=h.num_items;
 h.clear();
 drop i k;
run;
 UPDATED

Registration is open