BookmarkSubscribeRSS Feed
TCross
Calcite | Level 5

Hello, I am trying to summarize a dataset that contains both character and numeric data. Here's a very simple version of dataset:

 example one.png

The data is oil and gas fields, but it has multiple entries for each field (for when a oil field has deposits at multiple depths/reservoirs).

 

I understand how to summarize the numeric data by mean, mode, etc., but I want to be able to summarize the character data with 'modes' or most frequent as well.

 

Here's what I would want as the output table in the above example:

example two.png

 

Is there a function I could use to summarize the character data this way? The real table I would like to summarize has ~100 columns, so hopefully the solution isn't too cumbersome to use.

 

Thanks!

TC 

 

 

3 REPLIES 3
ballardw
Super User

Procs Freq, Report and Tabulate will all count character values and combinations of character values. Using Class statements in Proc Means or summary (an likely the NWAY option) you could also get the the character variables associated with the statistics of one or more of the numerics.

 

 

Your example only shows a single value that matches an "input" so it is hard to tell what you were looking for as a result. May be a few more input rows, output and description of the rules you are looking to enforce would help provide a more complete example.

KachiM
Rhodochrosite | Level 12

Do you like to get MODE using a data step? Little programming will take you there. The solution works like this.

 

[1] The data set is first sorted by field_name.

[2] We process records by the BY variable using a Do-Loop(BY-Group Processing). 

[3] We use one array to hold the rock_type and another array to count the number of times rock_type occurs.

[4] At the end of By-Group, the count[ ] array is searched for the largest value. Using that value, we find the name of the rock_type(rock_type_mode).

[5] The default output goes to the WANT data set.

 

data have;
input field_name :$9.  rock_type :$9. size;
datalines;
cantarell   carbonate  100
cantarell   carbonate  100
cantarell   clastic    100
lakach      clastic     13
lakach      clastic     13
lakach      carbonate   13
;
run;


proc sort data = have;
by field_name;
run;

data want;
   do i = 1 by 1 until(last.field_name);
      set have;
      by field_name;
      array rock[3] $9. _temporary_;
      array count[3] _temporary_;
     
      do j = 1 by 1 while(1);
         if rock[j] = rock_type then do; count[j] + 1; leave; end;
         if missing(rock[j]) then do; rock[j] = rock_type; count[j] = 1; leave; end;
      end;
      if last.field_name then do;
         largest = largest(1, of count[*]);
         indx = whichn(largest, of count[*]);
         rock_type_mode = rock[indx]; 
      end;
   end;
   call missing(of rock[*], of count[*]);
drop i j indx rock_type largest;
run;

proc print data = want;
run;

The ROCK[ ] array is sized to 3 in this case. You can change to any size, even millions - only the memory is the limitation .

Adding some more thoughts ................

 

I have deleted one unwanted statement ( j =1;).

 

Multi Mode:

It is possible to have more than one MODE in a distribution like bimode, trimode .. or multimode. The same program can be used. For instance, if you need to get bimode, get first mode using 1 in the first argument in the following statements and use 2 there to get the second mode. Similarly, 3 for trimode.

 

Lookup Speed:

The do - while Loop is used to search the array in a linear fashion. For large number of rock_type running to thousands search-time will be longer. There are alternatives like the use of FORMAT, Hashing, Binary Search etc. to reduce the run-time.

 

Do run this with your real data and share with us your experience particulary run-times. There is a SQL solution offered. Compare both the solutions.

 

 

         largest = largest(1, of count[*]);
         indx = whichn(largest, of count[*]);

 Use of $9:

Here 9 is the length of the longest rock_type in this example. Replace this number by the longest rock_type of your real data. Otherwise truncated strings will not work.

Ksharp
Super User

SQL is good.

 

 

data have;
input field_name :$9.  rock_type :$9. size;
datalines;
cantarell   carbonate  100
cantarell   carbonate  100
cantarell   clastic    100
lakach      clastic     13
lakach      clastic     13
lakach      carbonate   13
;
run;
proc sql;

select distinct field_name,rock_type,size
 from 
  (
  select *,count(*) as n
   from have
    group by field_name,rock_type
  )
group by field_name
 having n=max(n)
 
;
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 6207 views
  • 0 likes
  • 4 in conversation