BookmarkSubscribeRSS Feed
jaypennington
Calcite | Level 5

I have a dataset 

Id system1 system2 system3 system4

1 0001 0001 0001 0002

2 0002 0001 0002 0003

3 0003 0002 0002 0004

4 missing 0001 missing missing

 

I want to the new data set to include a new column which is the most frequent value across these systems.

Id system1 system2 system3 system4 mostfrequent countofmostfrequent

1 0001 0001 0001 0002 0001 3

2 0002 0001 0002 0003 0002 2

3 0003 0002 0002 0004 0002 2 

4 missing 0001 missing missing 0001 1

 

Any help you can provide would be greatly appreciated!

 

9 REPLIES 9
ballardw
Super User

Are these system variables numeric or character?

By "across these systems" do you mean within that observation (just to make sure I understand).

Does the ID variable repeat in the data set? If it does repeat does that affect how "most frequent" would be determined?

 

What are the rules for tie-breaking when the count for more than one value qualifies as 'most frequent': 2 values of 0002 and two values of 0003 or 4 different values out of 4 variables?

jaypennington
Calcite | Level 5

Thanks for the quick response!

The system variables are character but can convert them to numeric if it is easier. 

The is wide with one id per row and does not repeat. If there a tie breaks then pick the system that is the last value in the chain.

Quentin
Super User

If you transpose from wide to tall and skinny, you could do this with PROC FREQ or similar.

 

Even if you need to deliver wide data for some reason (e.g. Excel users), I would transform to tall and skinny in SAS, then run proc freq (or similar), and transpose back to wide when you're done with the data.

 

One lesson I learned early in my career was just because you are delivered wide data, and asked to return wide data, does not mean you're required to work with the wide data.  You're always free to change the data format while processing, to make it easier to work with in SAS (and less likely to make coding errors).  I spent a few years excited to learn arrays, multidimensional arrays, etc.  Then a brilliant boss pointed out how much simpler (and less error-prone) my code could be if I just normalized the data.

Ksharp
Super User

The most simple way is using PROC IML.

data have;
input Id (system1 system2 system3 system4) ($);
cards;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
4 . 0001 . .
;

proc iml;
use have(keep=system:) nobs n;
read all var _all_ into x;
close;
countofmostfrequent=j(n,1,.);
do i=1 to n;
  call tabulate(level,freq,x[i,]);
  countofmostfrequent[i]=max(freq);
end;
create countofmostfrequent var{countofmostfrequent};
append;
close;
quit;
data want;
 merge have countofmostfrequent;
run;
Ksharp
Super User

If you don't have module SAS/IML , could try Hash Table.

data have;
input Id (system1 system2 system3 system4) ($);
cards;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
4 . 0001 . .
;
data want;
 if _n_=1 then do;
   length k $ 100 n countofmostfrequent 8;
   call missing(k,n,countofmostfrequent);
   declare hash h();
   h.definekey('k');
   h.definedata('n');
   h.definedone();
 end;
set have;
h.clear();
array x{*} $ system:;
do i=1 to dim(x);
 if not missing(x{i}) then do;
   k=x{i};
   if h.find()=0 then do;n+1;h.replace(); end;
    else do;n=1;h.replace(); end; 
   countofmostfrequent=max(countofmostfrequent,n);
 end;
end;
drop k n i;
run;
jaypennington
Calcite | Level 5

This hash table works great! There is one more column that is needed. I also want to return the value the countofmostfrequent into a new column. Any help you could provide would be greatly appreciated!

jaypennington
Calcite | Level 5

Nevermind. I see that it is stored in column k. Thank you!

Ksharp
Super User

You mean the value have "countofmostfrequent" ?

 

data have;
input Id (system1 system2 system3 system4) ($);
cards;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
4 . 0001 . .
;
data want;
 if _n_=1 then do;
   length k value $ 100 n  countofmostfrequent 8;
   call missing(k,n,value,countofmostfrequent);
   declare hash h();
   h.definekey('k');
   h.definedata('n');
   h.definedone();
 end;
set have;
h.clear();
array x{*} $ system:;
do i=1 to dim(x);
 if not missing(x{i}) then do;
   k=x{i};
   if h.find()=0 then do;n+1;h.replace(); end;
    else do;n=1;h.replace(); end; 
   if countofmostfrequent<n then do;value=k; countofmostfrequent=n;end;
 end;
end;
drop k n i;
run;
mkeintz
PROC Star

The code below uses the ability to sort the SYSTEM values (call sortc or call sortn), and then counts contiguous identical values.  If, as you said, the SYSTEM values are character, then:

 

data have;
  input Id @3 (system1 system2 system3 system4) ($4. +1);
datalines;
1 0001 0001 0001 0002
2 0002 0001 0002 0003
3 0003 0002 0002 0004
run;

data want (drop=_:);
  set have;
  call sortc(of system:);
  length _str $300;    /*Long enough to hold all the SYSTEM values */
  _str=catx(' ',of system:);

 
  mode_val='            '; 
  mode_frq=0;

  do while (_str^=' ');
    do _frq=1 by 1 until(scan(_str,_frq+1)^=scan(_str,1));
    end;
    if _frq>mode_frq then do;
      mode_val=scan(_str,1);
      mode_frq=_frq;
    end;
    _str=tranwrd(_str||' ',scan(_str,1)||' ','');
  end;
  set have;   /*Reread the same obs, so SYSTEM values are in original order*/
run;

And if the SYSTEM variables are numeric, you only have to change three lines of code  (indicated by /*!!For numeric vals!!*/ below):

 

data want (drop=_:);
  set have;
  call sortn(of system:);   /*!!For numeric vals!!*/
  length _str $300;  /*Long enough to hold all the SYSTEM values */
  _str=catx(' ',of system:);

 
  mode_val=.;  /*!!For numeric vals!!*/
  mode_frq=0;

  do while (_str^=' ');
    do _frq=1 by 1 until(scan(_str,_frq+1)^=scan(_str,1));
    end;
    if _frq>mode_frq then do;
      mode_val=input(scan(_str,1),best32.); /*!!For numeric vals!!*/
      mode_frq=_frq;
    end;
    _str=tranwrd(_str||' ',scan(_str,1)||' ','');
  end;
  set have;   /*Reread the same obs, so the SYSTEM values are in original order*/
run;

And note that, while the CALL SORTC (or CALL SORTN) destroys the original order of values for SYSTEM1 through SYSTEM4, at the end of the DATA step is a second SET HAVE, which rereads the same observation, thereby restoring SYSTEM values to original order.  This uses almost no resources, since the reread is retrieving data from cached memory, not the disk storage device.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2386 views
  • 1 like
  • 5 in conversation