turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- How can I calculate the MODE?

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-25-2008 09:22 PM

Hi, I have a sticky problem. I have a dataset with about 41,000 records, and about 6 columns (plus a key), and I need to find the modal value (mode) across the 6 columns. As far as I can tell, there's no "MODE" function to go alongside the MEDIAN and MEAN functions, and the only procedure that does it is Univariate. The problem with Univariate is that it calculates the mode for a single variable across all columns, and with 41,000 records, I can't transpose the data (besides which, running Univariate 41,000 times, then stitching the results back together doesn't seem like a good solution).

Does anyone have a good solution for this? It looks like I'll have to do some fancy code to calculate the mode, but so far I've not got much of an idea of how to start. I'd probably start with loading the data into an array, then looping through, counting how many times each value appears in the data then writing that value out.

Anyone already met and beaten this problem?

Thanks

Chris.

Does anyone have a good solution for this? It looks like I'll have to do some fancy code to calculate the mode, but so far I've not got much of an idea of how to start. I'd probably start with loading the data into an array, then looping through, counting how many times each value appears in the data then writing that value out.

Anyone already met and beaten this problem?

Thanks

Chris.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

03-26-2008 07:18 PM

I managed to work it using 2 arrays and in the end. It doesn't feel particularly elegant, but it does the job. Excel actually has a MODE function, so that's probably the course I'll go (as it means the user will be able to do it themselves without involving me).

Anyway, here's my solution. If I had more time, and a need, I'd probably wrap it in some macro code to make it a bit more flexible, but hey ho.

data mode (drop=count1-count7 i j k compare biggest);

a = 1;

b = 4;

c = 4;

d = 6;

e = 10;

f = 12;

g = 13;

array vars {7} 4 a b c d e f g;

array count {7} 4 count1-count7;

do i = 1 to 7;

compare = vars{i};

do j = 1 to 7;

if vars{j} = compare then count{i} + 1;

end;

end;

biggest = max(of count1-count7);

do k = 1 to 7;

if count{k} = biggest then leave;

end;

mode = vars{k};

run;

Anyway, here's my solution. If I had more time, and a need, I'd probably wrap it in some macro code to make it a bit more flexible, but hey ho.

data mode (drop=count1-count7 i j k compare biggest);

a = 1;

b = 4;

c = 4;

d = 6;

e = 10;

f = 12;

g = 13;

array vars {7} 4 a b c d e f g;

array count {7} 4 count1-count7;

do i = 1 to 7;

compare = vars{i};

do j = 1 to 7;

if vars{j} = compare then count{i} + 1;

end;

end;

biggest = max(of count1-count7);

do k = 1 to 7;

if count{k} = biggest then leave;

end;

mode = vars{k};

run;

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

03-27-2008 04:35 PM

There are many issues when determining a mode.

1) for discrete values:

a) as Doc says, what about multiple modes, two or three values with 2 occurrances, or 2 values with 3 occurrances?

b) what if all the values are unique?

2) data grouping:

could you have 1,2,3 10, 15, 20 as the values?, then there is no discrete mode, but there is a clumping, so the "mode" would be the 1,2,3 group, which could be expressed as either "2" or "3" (midpoint or upper boundary).

Similarly, 1, 5, 10, 15, 18,19,20, would put the "mode" at the other end of the dataset.

And 1,2,3,18,19,20 would be bimodal, after a fashion. Message was edited by: Chuck

1) for discrete values:

a) as Doc says, what about multiple modes, two or three values with 2 occurrances, or 2 values with 3 occurrances?

b) what if all the values are unique?

2) data grouping:

could you have 1,2,3 10, 15, 20 as the values?, then there is no discrete mode, but there is a clumping, so the "mode" would be the 1,2,3 group, which could be expressed as either "2" or "3" (midpoint or upper boundary).

Similarly, 1, 5, 10, 15, 18,19,20, would put the "mode" at the other end of the dataset.

And 1,2,3,18,19,20 would be bimodal, after a fashion. Message was edited by: Chuck

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

03-27-2008 06:13 PM

Yeah, multimodal was something I considered as an issue, but was unable to find what the standard method for handling multi-modal data was. There's surprisingly little information about this that I could find.

I asked the guy who requested the the data in the first place and he said he was happy to just get the first mode (the data I had listed was in ascending order, however, that is unlikely to be the case in the real data, so I won't be taking the smallest mode, but the first mode in the data). If every data point is unique, it'll return the first data point.

Yeah, it's a limitation, but should work for what it's being used for in this case. (We have a bunch of census data, but population totals add up differently depending on what measure you're looking at, we just want the most likely value).

Thanks for looking in to it though.

If you have any suggestions on how this can be improved though, I'd be glad to hear them.

I asked the guy who requested the the data in the first place and he said he was happy to just get the first mode (the data I had listed was in ascending order, however, that is unlikely to be the case in the real data, so I won't be taking the smallest mode, but the first mode in the data). If every data point is unique, it'll return the first data point.

Yeah, it's a limitation, but should work for what it's being used for in this case. (We have a bunch of census data, but population totals add up differently depending on what measure you're looking at, we just want the most likely value).

Thanks for looking in to it though.

If you have any suggestions on how this can be improved though, I'd be glad to hear them.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

03-28-2008 08:59 AM

If you want the most likely value, that is a different question, and is somewhat answerable. But, this then assumes that each observation/record is a set of measurements of a common event. This would then demonstrate a central tendency in the numbers. The tradition of having a single number represent the set, is usually to use the appropriate mean -- arithmetic, geometric or harmonic. A lot of statistical research has also shown that the median can work better, especially for non-normal (skewed) distributions. These are simple and readily available functions.

But, if you want to create a modal analysis. Here's an idea.

Order the numbers.

Calculate the difference between each number (the distance from one number to the next).

Determine the median "distance" (if 0 then some fudge value, perhaps 1).

Use this median distance to define a bucket interval, dividing the data from lowest to highest, gouped within the bucket interval.

Count the number of items in each bucket.

Take the left edge (smallest value) of the left most "maximum" and the right edge of the right most "maximum" and determine the middle value and report that ( (l + r)/2 )

If there is only one mode, then the middle of that bucket will be reported.

If there are two modes, then the middle between them is reported.

If all the numbers are unique and evenly spaced, the middle of all the numbers is reported.

[pre]

assuming values > 0.

left = -1;

modes = 0;

do i=1 to buckets;

if bucket_size{i} = max then do;

if left < 0 then left = bucket_left{i};

right = bucket_right{i};

modes+1;

end;

end;

middle = (left + right) div 2;

[/pre]

you can now report the middle and the number of "modes". Message was edited by: Chuck

But, if you want to create a modal analysis. Here's an idea.

Order the numbers.

Calculate the difference between each number (the distance from one number to the next).

Determine the median "distance" (if 0 then some fudge value, perhaps 1).

Use this median distance to define a bucket interval, dividing the data from lowest to highest, gouped within the bucket interval.

Count the number of items in each bucket.

Take the left edge (smallest value) of the left most "maximum" and the right edge of the right most "maximum" and determine the middle value and report that ( (l + r)/2 )

If there is only one mode, then the middle of that bucket will be reported.

If there are two modes, then the middle between them is reported.

If all the numbers are unique and evenly spaced, the middle of all the numbers is reported.

[pre]

assuming values > 0.

left = -1;

modes = 0;

do i=1 to buckets;

if bucket_size{i} = max then do;

if left < 0 then left = bucket_left{i};

right = bucket_right{i};

modes+1;

end;

end;

middle = (left + right) div 2;

[/pre]

you can now report the middle and the number of "modes". Message was edited by: Chuck

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

03-27-2008 03:58 PM

Be aware, if your data are multimodal, your method gets the minimum mode.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

04-05-2008 12:39 PM

What about PROC FREQ, dump the output to table then PROC SORT by the COUNT?

Just a thought.

Ike Eisenhauer

Just a thought.

Ike Eisenhauer

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

05-21-2008 09:31 PM

Unfortunately Eisenhauer, the values I was needing to find the mode of were across rather than down (ie I wanted to find the mode across multiple variables one a single observation).

To use proc freq, I would have have to transpose the data first, then transpose it back after I'd done.

I did post my ultimate solution up there once I'd got it. Message was edited by: ChrisW

To use proc freq, I would have have to transpose the data first, then transpose it back after I'd done.

I did post my ultimate solution up there once I'd got it. Message was edited by: ChrisW

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to deleted_user

06-25-2008 11:13 AM

try using transpose procedure to make ur input dataset to another form ready for use in proc univariate

Example codes for the PROC TRANSPOSE

options nodate pageno=1 linesize=80 pagesize=40;

data fishdata;

infile datalines missover;

input Location & $10. Date date7.

Length1 Weight1 Length2 Weight2 Length3 Weight3

Length4 Weight4;

format date date7.;

datalines;

Cole Pond 2JUN95 31 .25 32 .3 32 .25 33 .3

Cole Pond 3JUL95 33 .32 34 .41 37 .48 32 .28

Cole Pond 4AUG95 29 .23 30 .25 34 .47 32 .3

Eagle Lake 2JUN95 32 .35 32 .25 33 .30

Eagle Lake 3JUL95 30 .20 36 .45

Eagle Lake 4AUG95 33 .30 33 .28 34 .42

;

proc transpose data=fishdata

out=fishlength(rename=(col1=Measurement));

var length1-length4;

by location date;

run;

proc print data=fishlength noobs;

title 'Fish Length Data for Each Location and Date';

run;

Example codes for the PROC TRANSPOSE

options nodate pageno=1 linesize=80 pagesize=40;

data fishdata;

infile datalines missover;

input Location & $10. Date date7.

Length1 Weight1 Length2 Weight2 Length3 Weight3

Length4 Weight4;

format date date7.;

datalines;

Cole Pond 2JUN95 31 .25 32 .3 32 .25 33 .3

Cole Pond 3JUL95 33 .32 34 .41 37 .48 32 .28

Cole Pond 4AUG95 29 .23 30 .25 34 .47 32 .3

Eagle Lake 2JUN95 32 .35 32 .25 33 .30

Eagle Lake 3JUL95 30 .20 36 .45

Eagle Lake 4AUG95 33 .30 33 .28 34 .42

;

proc transpose data=fishdata

out=fishlength(rename=(col1=Measurement));

var length1-length4;

by location date;

run;

proc print data=fishlength noobs;

title 'Fish Length Data for Each Location and Date';

run;

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to GodOfGambling_SAS

06-25-2008 11:18 AM

Another Note for you

The UNIVARIATE Procedure

Calculating the Mode

The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the values of the analysis variables or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest mode in the table labeled "Basic Statistical Measures" in the statistical output. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode. The WEIGHT statement has no effect on the mode. See Example 3.2.

The UNIVARIATE Procedure

Calculating the Mode

The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the values of the analysis variables or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest mode in the table labeled "Basic Statistical Measures" in the statistical output. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode. The WEIGHT statement has no effect on the mode. See Example 3.2.