BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hi, I have a sticky problem. I have a dataset with about 41,000 records, and about 6 columns (plus a key), and I need to find the modal value (mode) across the 6 columns. As far as I can tell, there's no "MODE" function to go alongside the MEDIAN and MEAN functions, and the only procedure that does it is Univariate. The problem with Univariate is that it calculates the mode for a single variable across all columns, and with 41,000 records, I can't transpose the data (besides which, running Univariate 41,000 times, then stitching the results back together doesn't seem like a good solution).

Does anyone have a good solution for this? It looks like I'll have to do some fancy code to calculate the mode, but so far I've not got much of an idea of how to start. I'd probably start with loading the data into an array, then looping through, counting how many times each value appears in the data then writing that value out.
Anyone already met and beaten this problem?
Thanks
Chris.
9 REPLIES 9
deleted_user
Not applicable
I managed to work it using 2 arrays and in the end. It doesn't feel particularly elegant, but it does the job. Excel actually has a MODE function, so that's probably the course I'll go (as it means the user will be able to do it themselves without involving me).
Anyway, here's my solution. If I had more time, and a need, I'd probably wrap it in some macro code to make it a bit more flexible, but hey ho.


data mode (drop=count1-count7 i j k compare biggest);
a = 1;
b = 4;
c = 4;
d = 6;
e = 10;
f = 12;
g = 13;

array vars {7} 4 a b c d e f g;
array count {7} 4 count1-count7;

do i = 1 to 7;
compare = vars{i};
do j = 1 to 7;
if vars{j} = compare then count{i} + 1;
end;
end;

biggest = max(of count1-count7);

do k = 1 to 7;
if count{k} = biggest then leave;
end;

mode = vars{k};

run;
deleted_user
Not applicable
There are many issues when determining a mode.

1) for discrete values:
a) as Doc says, what about multiple modes, two or three values with 2 occurrances, or 2 values with 3 occurrances?
b) what if all the values are unique?

2) data grouping:

could you have 1,2,3 10, 15, 20 as the values?, then there is no discrete mode, but there is a clumping, so the "mode" would be the 1,2,3 group, which could be expressed as either "2" or "3" (midpoint or upper boundary).

Similarly, 1, 5, 10, 15, 18,19,20, would put the "mode" at the other end of the dataset.

And 1,2,3,18,19,20 would be bimodal, after a fashion. Message was edited by: Chuck
deleted_user
Not applicable
Yeah, multimodal was something I considered as an issue, but was unable to find what the standard method for handling multi-modal data was. There's surprisingly little information about this that I could find.
I asked the guy who requested the the data in the first place and he said he was happy to just get the first mode (the data I had listed was in ascending order, however, that is unlikely to be the case in the real data, so I won't be taking the smallest mode, but the first mode in the data). If every data point is unique, it'll return the first data point.
Yeah, it's a limitation, but should work for what it's being used for in this case. (We have a bunch of census data, but population totals add up differently depending on what measure you're looking at, we just want the most likely value).

Thanks for looking in to it though.
If you have any suggestions on how this can be improved though, I'd be glad to hear them.
deleted_user
Not applicable
If you want the most likely value, that is a different question, and is somewhat answerable. But, this then assumes that each observation/record is a set of measurements of a common event. This would then demonstrate a central tendency in the numbers. The tradition of having a single number represent the set, is usually to use the appropriate mean -- arithmetic, geometric or harmonic. A lot of statistical research has also shown that the median can work better, especially for non-normal (skewed) distributions. These are simple and readily available functions.

But, if you want to create a modal analysis. Here's an idea.

Order the numbers.
Calculate the difference between each number (the distance from one number to the next).
Determine the median "distance" (if 0 then some fudge value, perhaps 1).
Use this median distance to define a bucket interval, dividing the data from lowest to highest, gouped within the bucket interval.
Count the number of items in each bucket.
Take the left edge (smallest value) of the left most "maximum" and the right edge of the right most "maximum" and determine the middle value and report that ( (l + r)/2 )
If there is only one mode, then the middle of that bucket will be reported.
If there are two modes, then the middle between them is reported.
If all the numbers are unique and evenly spaced, the middle of all the numbers is reported.

[pre]
assuming values > 0.

left = -1;
modes = 0;
do i=1 to buckets;
if bucket_size{i} = max then do;
if left < 0 then left = bucket_left{i};
right = bucket_right{i};
modes+1;
end;
end;
middle = (left + right) div 2;

[/pre]

you can now report the middle and the number of "modes". Message was edited by: Chuck
Doc_Duke
Rhodochrosite | Level 12
Be aware, if your data are multimodal, your method gets the minimum mode.
deleted_user
Not applicable
What about PROC FREQ, dump the output to table then PROC SORT by the COUNT?

Just a thought.

Ike Eisenhauer
deleted_user
Not applicable
Unfortunately Eisenhauer, the values I was needing to find the mode of were across rather than down (ie I wanted to find the mode across multiple variables one a single observation).
To use proc freq, I would have have to transpose the data first, then transpose it back after I'd done.

I did post my ultimate solution up there once I'd got it. Message was edited by: ChrisW
GodOfGambling_SAS
SAS Employee
try using transpose procedure to make ur input dataset to another form ready for use in proc univariate

Example codes for the PROC TRANSPOSE

options nodate pageno=1 linesize=80 pagesize=40;

data fishdata;
infile datalines missover;
input Location & $10. Date date7.
Length1 Weight1 Length2 Weight2 Length3 Weight3
Length4 Weight4;
format date date7.;
datalines;
Cole Pond 2JUN95 31 .25 32 .3 32 .25 33 .3
Cole Pond 3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond 4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake 2JUN95 32 .35 32 .25 33 .30
Eagle Lake 3JUL95 30 .20 36 .45
Eagle Lake 4AUG95 33 .30 33 .28 34 .42
;

proc transpose data=fishdata
out=fishlength(rename=(col1=Measurement));

var length1-length4;

by location date;
run;

proc print data=fishlength noobs;
title 'Fish Length Data for Each Location and Date';
run;
GodOfGambling_SAS
SAS Employee
Another Note for you

The UNIVARIATE Procedure



Calculating the Mode
The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the values of the analysis variables or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest mode in the table labeled "Basic Statistical Measures" in the statistical output. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode. The WEIGHT statement has no effect on the mode. See Example 3.2.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 18602 views
  • 0 likes
  • 3 in conversation