Re: Mode of nominal variables

catkat96 · Posted 03-19-2021 02:44 PM

Hello, I basically need to find out how to calculate the mode / the most frequent value from a column of strings, not numbers.

This is the data I have: many customers for many days

data Have;
	input ID callDate :ddmmyy. employment;
	format callDate ddmmyy10.;
	datalines;
1 15/02/2021 full-time
1 17/02/2021 full-time
1 19/02/2021 full-time
2 10/02/2021 part-time
2 15/02/2021 part-time
3 17/02/2021 full-time
3 19/02/2021 full-time
3 20/02/2021 full-time
4 08/02/2021 full-time
4 10/02/2021 full-time
4 13/02/2021 full-time
4 15/02/2021 full-time
4 17/02/2021 full-time
4 19/02/2021 full-time
5 10/02/2021 unemployed
;
run;

I am creating a table of statistics for each day of my table, where I count some variables and I average others. I owuld be interested in getting the mode per each day. The basic structre of this code is the following:

proc sql;
create table daily_stats as
select calldate
, count( distinct id) as id_number
, *insert mode code here*

from have
group by calldate;
quit;

It would have to look like this:

data daily_stats;
	input callDate :ddmmyy. id_number mode;
	format callDate ddmmyy10.;
	datalines;
08/02/2021 1 full-time
13/02/2021 1 full-time
10/02/2021 3 (to be fair i don't know what the answer would be in this case)
15/02/2021 3 full-time
17/02/2021 3 full-time
19/02/2021 3 full-time
20/02/2021 1 full-time
;
run;

If you can think of an answer in PROC SQL it would be great. Thanks!

FreelanceReinh · Posted 03-19-2021 07:31 PM

Hello @catkat96,

Here's one suggestion:

proc sql;
create table daily_stats as
select calldate
, count(distinct id) as id_number
, (select max(e) from
    (select e from
      (select employment as e, count(*) as c
       from have b
       where b.calldate=a.calldate
       group by e)
     having c=max(c))) as mode
from have a
group by calldate;
quit;

The innermost query creates a frequency table of the EMPLOYMENT values within the current CALLDATE value ("BY group").
Then the EMPLOYMENT value(s) -- plural in the case of ties -- having the maximum frequency are selected.
From this list of tied distinct EMPLOYMENT value(s) you can select what you like:
- my suggestion max(e), i.e., the last value in alphabetical order
- or min(e), i.e., the first value in alphabetical order
- or something else (e.g., by using a HAVING clause on that nesting level).

catkat96 · Posted 03-24-2021 08:23 AM

That is really great, it works fine!
I do wonder if you can think of a way to get many modes at the same time in a compressed manner? I've been copying this code for each mode I need to calculate but I have to do a few so it became very long, and maybe you have an idea 🙂

FreelanceReinh · Posted 03-24-2021 09:04 AM

Glad to hear that it works.

So you have more categorical variables like employment. Here's an example including a second categorical variable, catvar2.

data Have;
 input ID callDate :ddmmyy. employment :$15. catvar2 :$1.;
 format callDate ddmmyy10.;
 datalines;
1 15/02/2021 full-time A
1 17/02/2021 full-time B
1 19/02/2021 full-time B
2 10/02/2021 part-time A
2 15/02/2021 part-time C
3 17/02/2021 full-time C
3 19/02/2021 full-time A
3 20/02/2021 full-time A
4 08/02/2021 full-time B
4 10/02/2021 full-time A
4 13/02/2021 full-time A
4 15/02/2021 full-time C
4 17/02/2021 full-time A
4 19/02/2021 full-time B
5 10/02/2021 unemployed B
;

To get more compact PROC SQL code you can put the repetitive part of the SELECT statement into a macro:

%macro mode(var);
(select max(x) from
  (select x from
    (select &var as x, count(*) as c
     from have b
     where b.calldate=a.calldate
     group by x)
   having c=max(c))) as mode_&var
%mend mode;

Then you just need a short macro call per categorical variable in the PROC SQL step:

proc sql;
create table daily_stats as
select calldate
, count(distinct id) as id_number
, %mode(employment)
, %mode(catvar2)
from have a
group by calldate;
quit;

Result:

                                    mode_        mode_
Obs     callDate     id_number    employment    catvar2

 1     08/02/2021        1        full-time        B
 2     10/02/2021        3        unemployed       A
 3     13/02/2021        1        full-time        A
 4     15/02/2021        3        full-time        C
 5     17/02/2021        3        full-time        C
 6     19/02/2021        3        full-time        B
 7     20/02/2021        1        full-time        A

This works for character and numeric categorical variables.

Mode of nominal variables