Re: Summarizing Character Variable by most frequent

TCross · Posted 07-18-2016 12:02 PM

Hello, I am trying to summarize a dataset that contains both character and numeric data. Here's a very simple version of dataset:

The data is oil and gas fields, but it has multiple entries for each field (for when a oil field has deposits at multiple depths/reservoirs).

I understand how to summarize the numeric data by mean, mode, etc., but I want to be able to summarize the character data with 'modes' or most frequent as well.

Here's what I would want as the output table in the above example:

Is there a function I could use to summarize the character data this way? The real table I would like to summarize has ~100 columns, so hopefully the solution isn't too cumbersome to use.

Thanks!

TC

ballardw · Posted 07-18-2016 12:19 PM

Procs Freq, Report and Tabulate will all count character values and combinations of character values. Using Class statements in Proc Means or summary (an likely the NWAY option) you could also get the the character variables associated with the statistics of one or more of the numerics.

Your example only shows a single value that matches an "input" so it is hard to tell what you were looking for as a result. May be a few more input rows, output and description of the rules you are looking to enforce would help provide a more complete example.

KachiM · Posted 07-18-2016 07:09 PM

Do you like to get MODE using a data step? Little programming will take you there. The solution works like this.

[1] The data set is first sorted by field_name.

[2] We process records by the BY variable using a Do-Loop(BY-Group Processing).

[3] We use one array to hold the rock_type and another array to count the number of times rock_type occurs.

[4] At the end of By-Group, the count[ ] array is searched for the largest value. Using that value, we find the name of the rock_type(rock_type_mode).

[5] The default output goes to the WANT data set.

data have;
input field_name :$9.  rock_type :$9. size;
datalines;
cantarell   carbonate  100
cantarell   carbonate  100
cantarell   clastic    100
lakach      clastic     13
lakach      clastic     13
lakach      carbonate   13
;
run;


proc sort data = have;
by field_name;
run;

data want;
   do i = 1 by 1 until(last.field_name);
      set have;
      by field_name;
      array rock[3] $9. _temporary_;
      array count[3] _temporary_;
     
      do j = 1 by 1 while(1);
         if rock[j] = rock_type then do; count[j] + 1; leave; end;
         if missing(rock[j]) then do; rock[j] = rock_type; count[j] = 1; leave; end;
      end;
      if last.field_name then do;
         largest = largest(1, of count[*]);
         indx = whichn(largest, of count[*]);
         rock_type_mode = rock[indx]; 
      end;
   end;
   call missing(of rock[*], of count[*]);
drop i j indx rock_type largest;
run;

proc print data = want;
run;

The ROCK[ ] array is sized to 3 in this case. You can change to any size, even millions - only the memory is the limitation .

Adding some more thoughts ................

I have deleted one unwanted statement ( j =1;).

Multi Mode:

It is possible to have more than one MODE in a distribution like bimode, trimode .. or multimode. The same program can be used. For instance, if you need to get bimode, get first mode using 1 in the first argument in the following statements and use 2 there to get the second mode. Similarly, 3 for trimode.

Lookup Speed:

The do - while Loop is used to search the array in a linear fashion. For large number of rock_type running to thousands search-time will be longer. There are alternatives like the use of FORMAT, Hashing, Binary Search etc. to reduce the run-time.

Do run this with your real data and share with us your experience particulary run-times. There is a SQL solution offered. Compare both the solutions.

         largest = largest(1, of count[*]);
         indx = whichn(largest, of count[*]);

Use of $9:

Here 9 is the length of the longest rock_type in this example. Replace this number by the longest rock_type of your real data. Otherwise truncated strings will not work.

Ksharp · Posted 07-19-2016 02:05 AM

SQL is good.

data have;
input field_name :$9.  rock_type :$9. size;
datalines;
cantarell   carbonate  100
cantarell   carbonate  100
cantarell   clastic    100
lakach      clastic     13
lakach      clastic     13
lakach      carbonate   13
;
run;
proc sql;

select distinct field_name,rock_type,size
 from 
  (
  select *,count(*) as n
   from have
    group by field_name,rock_type
  )
group by field_name
 having n=max(n)
 
;
quit;

Summarizing Character Variable by most frequent