Re: Help counting the most frequent value across multiple columns and ...

jaypennington · Posted 06-11-2024 06:10 PM

I have a dataset

Id system1 system2 system3 system4

1 0001 0001 0001 0002

2 0002 0001 0002 0003

3 0003 0002 0002 0004

4 missing 0001 missing missing

I want to the new data set to include a new column which is the most frequent value across these systems.

Id system1 system2 system3 system4 mostfrequent countofmostfrequent

1 0001 0001 0001 0002 0001 3

2 0002 0001 0002 0003 0002 2

3 0003 0002 0002 0004 0002 2

4 missing 0001 missing missing 0001 1

Any help you can provide would be greatly appreciated!

ballardw · Posted 06-11-2024 06:15 PM

Are these system variables numeric or character?

By "across these systems" do you mean within that observation (just to make sure I understand).

Does the ID variable repeat in the data set? If it does repeat does that affect how "most frequent" would be determined?

What are the rules for tie-breaking when the count for more than one value qualifies as 'most frequent': 2 values of 0002 and two values of 0003 or 4 different values out of 4 variables?

jaypennington · Posted 06-11-2024 07:04 PM

Thanks for the quick response!

The system variables are character but can convert them to numeric if it is easier.

The is wide with one id per row and does not repeat. If there a tie breaks then pick the system that is the last value in the chain.

Quentin · Posted 06-11-2024 07:18 PM

If you transpose from wide to tall and skinny, you could do this with PROC FREQ or similar.

Even if you need to deliver wide data for some reason (e.g. Excel users), I would transform to tall and skinny in SAS, then run proc freq (or similar), and transpose back to wide when you're done with the data.

One lesson I learned early in my career was just because you are delivered wide data, and asked to return wide data, does not mean you're required to work with the wide data. You're always free to change the data format while processing, to make it easier to work with in SAS (and less likely to make coding errors). I spent a few years excited to learn arrays, multidimensional arrays, etc. Then a brilliant boss pointed out how much simpler (and less error-prone) my code could be if I just normalized the data.

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Ksharp · Posted 06-11-2024 09:15 PM

The most simple way is using PROC IML.

data have;
input Id (system1 system2 system3 system4) ($);
cards;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
4 . 0001 . .
;

proc iml;
use have(keep=system:) nobs n;
read all var _all_ into x;
close;
countofmostfrequent=j(n,1,.);
do i=1 to n;
  call tabulate(level,freq,x[i,]);
  countofmostfrequent[i]=max(freq);
end;
create countofmostfrequent var{countofmostfrequent};
append;
close;
quit;
data want;
 merge have countofmostfrequent;
run;

Ksharp · Posted 06-11-2024 09:47 PM

If you don't have module SAS/IML , could try Hash Table.

data have;
input Id (system1 system2 system3 system4) ($);
cards;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
4 . 0001 . .
;
data want;
 if _n_=1 then do;
   length k $ 100 n countofmostfrequent 8;
   call missing(k,n,countofmostfrequent);
   declare hash h();
   h.definekey('k');
   h.definedata('n');
   h.definedone();
 end;
set have;
h.clear();
array x{*} $ system:;
do i=1 to dim(x);
 if not missing(x{i}) then do;
   k=x{i};
   if h.find()=0 then do;n+1;h.replace(); end;
    else do;n=1;h.replace(); end; 
   countofmostfrequent=max(countofmostfrequent,n);
 end;
end;
drop k n i;
run;

jaypennington · Posted 06-16-2024 01:20 PM

This hash table works great! There is one more column that is needed. I also want to return the value the countofmostfrequent into a new column. Any help you could provide would be greatly appreciated!

jaypennington · Posted 06-16-2024 01:25 PM

Nevermind. I see that it is stored in column k. Thank you!

Ksharp · Posted 06-16-2024 09:02 PM

You mean the value have "countofmostfrequent" ?

data have;
input Id (system1 system2 system3 system4) ($);
cards;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
4 . 0001 . .
;
data want;
 if _n_=1 then do;
   length k value $ 100 n  countofmostfrequent 8;
   call missing(k,n,value,countofmostfrequent);
   declare hash h();
   h.definekey('k');
   h.definedata('n');
   h.definedone();
 end;
set have;
h.clear();
array x{*} $ system:;
do i=1 to dim(x);
 if not missing(x{i}) then do;
   k=x{i};
   if h.find()=0 then do;n+1;h.replace(); end;
    else do;n=1;h.replace(); end; 
   if countofmostfrequent<n then do;value=k; countofmostfrequent=n;end;
 end;
end;
drop k n i;
run;

mkeintz · Posted 06-11-2024 11:55 PM

The code below uses the ability to sort the SYSTEM values (call sortc or call sortn), and then counts contiguous identical values. If, as you said, the SYSTEM values are character, then:

data have;
  input Id @3 (system1 system2 system3 system4) ($4. +1);
datalines;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
run;

data want (drop=_:);
  set have;
  call sortc(of system:);
  length _str $300;    /*Long enough to hold all the SYSTEM values */
  _str=catx(' ',of system:);

 
  mode_val='            '; 
  mode_frq=0;

  do while (_str^=' ');
    do _frq=1 by 1 until(scan(_str,_frq+1)^=scan(_str,1));
    end;
    if _frq>mode_frq then do;
      mode_val=scan(_str,1);
      mode_frq=_frq;
    end;
    _str=tranwrd(_str||' ',scan(_str,1)||' ','');
  end;
  set have;   /*Reread the same obs, so SYSTEM values are in original order*/
run;

And if the SYSTEM variables are numeric, you only have to change three lines of code (indicated by /*!!For numeric vals!!*/ below):

data want (drop=_:);
  set have;
  call sortn(of system:);   /*!!For numeric vals!!*/
  length _str $300;  /*Long enough to hold all the SYSTEM values */
  _str=catx(' ',of system:);

 
  mode_val=.;  /*!!For numeric vals!!*/
  mode_frq=0;

  do while (_str^=' ');
    do _frq=1 by 1 until(scan(_str,_frq+1)^=scan(_str,1));
    end;
    if _frq>mode_frq then do;
      mode_val=input(scan(_str,1),best32.); /*!!For numeric vals!!*/
      mode_frq=_frq;
    end;
    _str=tranwrd(_str||' ',scan(_str,1)||' ','');
  end;
  set have;   /*Reread the same obs, so the SYSTEM values are in original order*/
run;

And note that, while the CALL SORTC (or CALL SORTN) destroys the original order of values for SYSTEM1 through SYSTEM4, at the end of the DATA step is a second SET HAVE, which rereads the same observation, thereby restoring SYSTEM values to original order. This uses almost no resources, since the reread is retrieving data from cached memory, not the disk storage device.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Help counting the most frequent value across multiple columns and return that value to a new column.

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

Re: Help counting the most frequent value across multiple columns and return that value to a new col

SAS Innovate 2025: Call for Content

Classroom Training Available!