Re: Mode computation for different variables in SAS

Heiko_Can · Posted 06-26-2016 10:16 AM

I need to compute the mode for different variables for the same id using SAS. The original dataset is

    data df_a;
    input id 	obs 	e1 	e2 	e3 	b0 	b1 	b2 	b3 	b4 	b5;
    datalines;
    2 	2 	1 	2 	3 	600 	0 	200 	0 	530 	550
    2 	1 	1 	2 	3 	600 	0 	200 	0 	530 	550
    3 	3 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	2 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	1 	2 	2 	1 	150 	500 	. 	120 	3 	58
    4 	3 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	2 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	1 	1 	2 	3 	200 	300 	550 	270 	0 	50
    5 	1 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	2 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	3 	1 	2 	. 	20 	20 	200 	100 	500 	57
    run;

The desired output is

    id	obs	e1	e2	e3	mode_e	        b0	b1	b2	b3	b4	b5	mode_b
    2	2	1	2	3	random choice	600	0	200	0	530	550	0
    2	1	1	2	3	random choice	600	0	200	0	530	550	0
    3	3	2	2	3	2	            150	500	. 	120	3	58	random choice
    3	2	2	2	3	2	            150	500	. 	120	3	58	random choice
    3	1	2	2	1	2	            150	500	. 	120	3	58	random choice
    4	3	1	2	3	random choice	200	300	550	270	0	50	random choice
    4	2	1	2	3	random choice	200	300	550	270	0	50	random choice
    4	1	1	2	3	random choice	200	300	550	270	0	50	random choice
    5	1	1	2	. 	random choice	20	20	200	100	500	57	20
    5	2	1	2	. 	random choice	20	20	200	100	500	57	20
    5	3	1	2	. 	random choice	20	20	200	100	500	57	20

Points to consider
i) I need the mode for variables e1-e3 and b0-b5 in a new column (mode_e, mode_b).
iii) If multimode, select one randomly.
iv) keep all the variables in the final output.
v) The code must be able to compute this for large database

I have tried to use the code proposed here https://communities.sas.com/t5/Base-SAS-Programming/Calculating-Mode-across-variables/td-p/150375 but it does not give the desired output.

Reeza · Posted 06-26-2016 10:25 AM

Can we assume b/e will stay the same for each record of each ID?

How did that solution not work?

Heiko_Can · Posted 06-26-2016 10:53 AM

Hi Reeza,

Yes, the mode for the variables must be the same for each id. For example for id=2 and the variables e1 to e3 the mode must be chosen randomly for this id but it must be the same for both observations.

id  obs e1  e2  e3  mode_e  b0  b1  b2  b3  b4  b5  mode_b
2   2   1   2   3   2       600 0   200 0   530 550 0
2   1   1   2   3   2       600 0   200 0   530 550 0

The linked solution returns the minimum value rather than a random choice.

Reeza · Posted 06-26-2016 11:04 AM

The last solution returns minimum. The green/correct one returns a random one.

rogerjdeangelis · Posted 06-27-2016 03:46 PM

Computing the mode by row (combination of R and R)

HAVE

Need the mode for variables e1-e3 and  b0-b5 in a new column (mode_e, mode_b).
If multimode, select one randomly.
If unique select select one randomly.
Keep all the variables in the final output.
The code must be able to compute this for large database.

The R code below can process about 100,000 records a minute
on my 2008 T7400(64gb ram) running 8 jobs in parallel.
10 minutes to do 1 million, 1000 minutes to 100 millon (still small scale).
Too long to get to big data (ie 10 billion +).
10 billion would take 100,000 minutes. (100000/1440) = 69 days.
SAS should be faster, but not as elgant(maybe IML)?.

The solutiion below will not scale to big data (ie 10 billion+ rows)
Maybe it will work work with millions of rows

* added a key - I like I primary key;
data "d:/sd1/df_a.sas7bdat" ;
 retain id obs key 0;
 input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
 key=_n_;
 datalines;
 2 2 1 2 3 600 0 200 0 530 550
 2 1 1 2 3 600 0 200 0 530 550
 3 3 2 2 3 150 500 . 120 3 58
 3 2 2 2 3 150 500 . 120 3 58
 3 1 2 2 1 150 500 . 120 3 58
 4 3 1 2 3 200 300 550 270 0 50
 4 2 1 2 3 200 300 550 270 0 50
 4 1 1 2 3 200 300 550 270 0 50
 5 1 1 2 . 20 20 200 200 500 57
 5 2 1 2 . 20 20 200 200 500 57
 5 3 1 2 . 20 20 200 200 500 57
 run;quit;

HAVE

/*
Up to 40 obs WC000001.DF_A total obs=11

 ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5

  2     2       1     1     2     3    600      0    200      0    530    550
  2     1       2     1     2     3    600      0    200      0    530    550
  3     3       3     2     2     3    150    500      .    120      3     58
  3     2       4     2     2     3    150    500      .    120      3     58
  3     1       5     2     2     1    150    500      .    120      3     58
  4     3       6     1     2     3    200    300    550    270      0     50
  4     2       7     1     2     3    200    300    550    270      0     50
  4     1       8     1     2     3    200    300    550    270      0     50
  5     1       9     1     2     .     20     20    200    100    500     57
  5     2      10     1     2     .     20     20    200    100    500     57
  5     3      11     1     2     .     20     20    200    100    500     57

WANT

Up to 40 obs from xpt.keyres total obs=&tob

ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

 2     2       1     1     2     3    600      0    200      0    530    550       1          0  single mode
 2     1       2     1     2     3    600      0    200      0    530    550       2          0  single mode
 3     3       3     2     2     3    150    500      .    120      3     58       3          3  random
 3     2       4     2     2     3    150    500      .    120      3     58       2        500  random
 3     1       5     2     2     1    150    500      .    120      3     58       2          3  random
 4     3       6     1     2     3    200    300    550    270      0     50       6        300  random
 4     2       7     1     2     3    200    300    550    270      0     50       7        270  random
 4     1       8     1     2     3    200    300    550    270      0     50       2        270  random
 5     1       9     1     2     .     20     20    200    200    500     57       2         20  multimode(random)
 5     2      10     1     2     .     20     20    200    200    500     57       2         200 multimode(random)
 5     3      11     1     2     .     20     20    200    200    500     57       2         20  multimode(random)

*/

SOLUTION
/*
Using list lapply find max count of duplicates
Get the values corresponding to max counts
Sample random one of multiple modes
Sample random one if no mode (all values distinct)
Sample has an issue when there is only one valee so I use resamp
*/

%utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:11)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    keyres;
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
));

/* R LOG
> library(haven); library(SASxport); datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}}; rmode <- functi
on(x) {   x <- sort(x);   u <- unique(x);   y <- lapply(u, function(y) length(x[x==y]));
u[which( unlist(y) == max(unlist(y)))]; }; mode_e<-unlist(lapply(apply(datcut[,c
(4:6)],1,rmode),resamp)); mode_b<-unlist(l

   ID OBS KEY E1 E2 E3  B0  B1  B2  B3  B4  B5 mode_e mode_b
1   2   2   1  1  2  3 600   0 200   0 530 550      2      0
2   2   1   2  1  2  3 600   0 200   0 530 550      1      0

3   3   3   3  2  2  3 150 500  NA 120   3  58      2    150 Random
4   3   2   4  2  2  3 150 500  NA 120   3  58      2    120 Random
5   3   1   5  2  2  1 150 500  NA 120   3  58      2    500 Random

6   4   3   6  1  2  3 200 300 550 270   0  50      2      0 Random
7   4   2   7  1  2  3 200 300 550 270   0  50      1    270 Random
8   4   1   8  1  2  3 200 300 550 270   0  50      1    270 Random

9   5   1   9  1  2 NA  20  20 200 200 500  57      1     200 multimode
10  5   2  10  1  2 NA  20  20 200 200 500  57      2     20  multimode
11  5   3  11  1  2 NA  20  20 200 200 500  57      2     200 multimode
*/


libname xpt xport 'd:/xpt/xfr.xpt';
proc print data=xpt.keyres;
;run;quit;
libname xpt clear;

/*
SAS Proc print
ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

  2     2       1     1     2     3    600      0    200      0    530    550       2          0
  2     1       2     1     2     3    600      0    200      0    530    550       1          0
  3     3       3     2     2     3    150    500      .    120      3     58       2        150
  3     2       4     2     2     3    150    500      .    120      3     58       2        120
  3     1       5     2     2     1    150    500      .    120      3     58       2        500
  4     3       6     1     2     3    200    300    550    270      0     50       2          0
  4     2       7     1     2     3    200    300    550    270      0     50       1        270
  4     1       8     1     2     3    200    300    550    270      0     50       1        270
  5     1       9     1     2     .     20     20    200    100    500     57       1        200
  5     2      10     1     2     .     20     20    200    100    500     57       2         20
  5     3      11     1     2     .     20     20    200    100    500     57       2        200
*/


SCALE THE PROBLEM

* This solution only scales well for small data, in the millions;
* Create a very small dataset of 400,000 obs;

* We will run 8 parallel jobs so lets
  split the 400,000  observations
  into 8 pieces of 50,000 obs each;

* Create a small datasets about 400,000 obs;

* 400,000;
data "d:/wrk/datcut.sas7bdat";
   retain key 0;
   set "d:/sd1/df_a.sas7bdat"(drop=key obs=10);
   do i=1 to 40000;
      key=key+1;
      output;
   end;
   drop i;
;run;quit;

/*
 Up to 40 obs "d:/wrk/datcut.sas7bdat" total obs=400,000

 KEY    ID    OBS    E1    E2    E3     B0    B1     B2    B3     B4     B5

   1     2     2      1     2     3    600     0    200     0    530    550
   2     2     2      1     2     3    600     0    200     0    530    550
   3     2     2      1     2     3    600     0    200     0    530    550
   4     2     2      1     2     3    600     0    200     0    530    550
   5     2     2      1     2     3    600     0    200     0    530    550
   6     2     2      1     2     3    600     0    200     0    530    550
   7     2     2      1     2     3    600     0    200     0    530    550
   8     2     2      1     2     3    600     0    200     0    530    550
*/

%macro getmode(set,fyl=d:/wrk/datcut.sas7bdat,span=40000);

    /*
      %let set=5;
      %let fyl=d:/wrk/datcut.sas7bdat;
      %let span=5000;
    */

    %local b e;

    %let e=%sysevalf(&set * &span );
    %let b=%sysevalf((&set - 1)*&span +1);

    %put Stary  &=b;
    %put END    &=e;

    * grab 50,000 subset;
    data "d:/wrk/subset&set..sas7bdat";
       set "&fyl"(firstobs=&b obs=&e);
    run;quit;

    %utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/wrk/subset&set..sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:12)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
    );


%mend getmode;

* if you tested with let statements inside macro (creates globals);
%symdel set b e fyl;

%getmode(3);

libname xpt xport 'd:/xpt/xfr3.xpt';
proc print data=xpt.keyres(obs=10);
;run;quit;
libname xpt clear;



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas
-work h:\wrk));


options noxwait noxsync;
%let tym=%sysfunc(time());
systask kill sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
systask command "&_s -termstmt %nrstr(%getmode(1);) -log G:\wrk\sys1.log" taskname=sys1;
systask command "&_s -termstmt %nrstr(%getmode(2);) -log G:\wrk\sys2.log" taskname=sys2;
systask command "&_s -termstmt %nrstr(%getmode(3);) -log G:\wrk\sys3.log" taskname=sys3;
systask command "&_s -termstmt %nrstr(%getmode(4);) -log G:\wrk\sys4.log" taskname=sys4;
systask command "&_s -termstmt %nrstr(%getmode(5);) -log G:\wrk\sys5.log" taskname=sys5;
systask command "&_s -termstmt %nrstr(%getmode(6);) -log G:\wrk\sys6.log" taskname=sys6;
systask command "&_s -termstmt %nrstr(%getmode(7);) -log G:\wrk\sys7.log" taskname=sys7;
systask command "&_s -termstmt %nrstr(%getmode(8);) -log G:\wrk\sys8.log" taskname=sys8;
waitfor sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
%put %sysevalf( %sysfunc(time()) - &tym);


/*
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
      real time           4:15.10


                                           Cumulative    Cumulative
E1    E2    E3    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------------
 1     2     .       80000       20.00         80000        20.00
 1     2     3      200000       50.00        280000        70.00
 2     2     1       40000       10.00        320000        80.00
 2     2     3       80000       20.00        400000       100.00

*/


%macro vueapp(dummy);

  %local i;
  %do i=1 %to 8;
    libname xpt&i xport "d:/xpt/xfr&i..xpt";
  %end;

  libname sd1 "d:/sd1";

  proc datasets lib=sd1 mt=view;
  delete keyresall;
  ;run;quit;

  data sd1.keyresall/view=sd1.keyresall;
    set
       xpt1.keyres
       xpt2.keyres
       xpt3.keyres
       xpt4.keyres
       xpt5.keyres
       xpt6.keyres
       xpt7.keyres
       xpt8.keyres
    ;
  ;run;quit;

  proc freq data=sd1.keyresall;
    tables e1*e2*e3 / list missing;
  ;run;quit;

%mend vueapp;

%vueapp;


libname xpt xport 'd:/xpt/xfr8.xpt';
proc print data=sd1.keyresall(where=(uniform(-1)<10/400000));
;run;quit;
libname xpt clear;

 /*
    KEY     ID    OBS    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

   12384     2     2      1     2     3    600      0    200      0    530    550       2          0
   50429     2     1      1     2     3    600      0    200      0    530    550       2          0
   54631     2     1      1     2     3    600      0    200      0    530    550       1          0
   77263     2     1      1     2     3    600      0    200      0    530    550       2          0
   77572     2     1      1     2     3    600      0    200      0    530    550       1          0
  130353     3     2      2     2     3    150    500      .    120      3     58       2          3
  233744     4     3      1     2     3    200    300    550    270      0     50       2        550
  319192     4     1      1     2     3    200    300    550    270      0     50       1        200
*/

rogerjdeangelis · Posted 06-27-2016 07:10 PM

Computing the mode by row (combination of R and R)

HAVE

Need the mode for variables e1-e3 and  b0-b5 in a new column (mode_e, mode_b).
If multimode, select one randomly.
If unique select select one randomly.
Keep all the variables in the final output.
The code must be able to compute this for large database.

The R code below can process about 100,000 records a minute
on my 2008 T7400(64gb ram) running 8 jobs in parallel.
10 minutes to do 1 million, 1000 minutes to 100 millon (still small scale).
Too long to get to big data (ie 10 billion +).
10 billion would take 100,000 minutes. (100000/1440) = 69 days.
SAS should be faster, but not as elgant(maybe IML)?.

The solutiion below will not scale to big data (ie 10 billion+ rows)
Maybe it will work work with millions of rows

* added a key - I like I primary key;
data "d:/sd1/df_a.sas7bdat" ;
 retain id obs key 0;
 input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
 key=_n_;
 datalines;
 2 2 1 2 3 600 0 200 0 530 550
 2 1 1 2 3 600 0 200 0 530 550
 3 3 2 2 3 150 500 . 120 3 58
 3 2 2 2 3 150 500 . 120 3 58
 3 1 2 2 1 150 500 . 120 3 58
 4 3 1 2 3 200 300 550 270 0 50
 4 2 1 2 3 200 300 550 270 0 50
 4 1 1 2 3 200 300 550 270 0 50
 5 1 1 2 . 20 20 200 200 500 57
 5 2 1 2 . 20 20 200 200 500 57
 5 3 1 2 . 20 20 200 200 500 57
 run;quit;

HAVE

/*
Up to 40 obs WC000001.DF_A total obs=11

 ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5

  2     2       1     1     2     3    600      0    200      0    530    550
  2     1       2     1     2     3    600      0    200      0    530    550
  3     3       3     2     2     3    150    500      .    120      3     58
  3     2       4     2     2     3    150    500      .    120      3     58
  3     1       5     2     2     1    150    500      .    120      3     58
  4     3       6     1     2     3    200    300    550    270      0     50
  4     2       7     1     2     3    200    300    550    270      0     50
  4     1       8     1     2     3    200    300    550    270      0     50
  5     1       9     1     2     .     20     20    200    100    500     57
  5     2      10     1     2     .     20     20    200    100    500     57
  5     3      11     1     2     .     20     20    200    100    500     57

WANT

Up to 40 obs from xpt.keyres total obs=&tob

ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

 2     2       1     1     2     3    600      0    200      0    530    550       1          0  single mode
 2     1       2     1     2     3    600      0    200      0    530    550       2          0  single mode
 3     3       3     2     2     3    150    500      .    120      3     58       3          3  random
 3     2       4     2     2     3    150    500      .    120      3     58       2        500  random
 3     1       5     2     2     1    150    500      .    120      3     58       2          3  random
 4     3       6     1     2     3    200    300    550    270      0     50       6        300  random
 4     2       7     1     2     3    200    300    550    270      0     50       7        270  random
 4     1       8     1     2     3    200    300    550    270      0     50       2        270  random
 5     1       9     1     2     .     20     20    200    200    500     57       2         20  multimode(random)
 5     2      10     1     2     .     20     20    200    200    500     57       2         200 multimode(random)
 5     3      11     1     2     .     20     20    200    200    500     57       2         20  multimode(random)

*/

SOLUTION
/*
Using list lapply find max count of duplicates
Get the values corresponding to max counts
Sample random one of multiple modes
Sample random one if no mode (all values distinct)
Sample has an issue when there is only one valee so I use resamp
*/

%utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:11)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    keyres;
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
));

/* R LOG
> library(haven); library(SASxport); datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}}; rmode <- functi
on(x) {   x <- sort(x);   u <- unique(x);   y <- lapply(u, function(y) length(x[x==y]));
u[which( unlist(y) == max(unlist(y)))]; }; mode_e<-unlist(lapply(apply(datcut[,c
(4:6)],1,rmode),resamp)); mode_b<-unlist(l

   ID OBS KEY E1 E2 E3  B0  B1  B2  B3  B4  B5 mode_e mode_b
1   2   2   1  1  2  3 600   0 200   0 530 550      2      0
2   2   1   2  1  2  3 600   0 200   0 530 550      1      0

3   3   3   3  2  2  3 150 500  NA 120   3  58      2    150 Random
4   3   2   4  2  2  3 150 500  NA 120   3  58      2    120 Random
5   3   1   5  2  2  1 150 500  NA 120   3  58      2    500 Random

6   4   3   6  1  2  3 200 300 550 270   0  50      2      0 Random
7   4   2   7  1  2  3 200 300 550 270   0  50      1    270 Random
8   4   1   8  1  2  3 200 300 550 270   0  50      1    270 Random

9   5   1   9  1  2 NA  20  20 200 200 500  57      1     200 multimode
10  5   2  10  1  2 NA  20  20 200 200 500  57      2     20  multimode
11  5   3  11  1  2 NA  20  20 200 200 500  57      2     200 multimode
*/


libname xpt xport 'd:/xpt/xfr.xpt';
proc print data=xpt.keyres;
;run;quit;
libname xpt clear;

/*
SAS Proc print
ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

  2     2       1     1     2     3    600      0    200      0    530    550       2          0
  2     1       2     1     2     3    600      0    200      0    530    550       1          0
  3     3       3     2     2     3    150    500      .    120      3     58       2        150
  3     2       4     2     2     3    150    500      .    120      3     58       2        120
  3     1       5     2     2     1    150    500      .    120      3     58       2        500
  4     3       6     1     2     3    200    300    550    270      0     50       2          0
  4     2       7     1     2     3    200    300    550    270      0     50       1        270
  4     1       8     1     2     3    200    300    550    270      0     50       1        270
  5     1       9     1     2     .     20     20    200    100    500     57       1        200
  5     2      10     1     2     .     20     20    200    100    500     57       2         20
  5     3      11     1     2     .     20     20    200    100    500     57       2        200
*/


SCALE THE PROBLEM

* This solution only scales well for small data, in the millions;
* Create a very small dataset of 400,000 obs;

* We will run 8 parallel jobs so lets
  split the 400,000  observations
  into 8 pieces of 50,000 obs each;

* Create a small datasets about 400,000 obs;

* 400,000;
data "d:/wrk/datcut.sas7bdat";
   retain key 0;
   set "d:/sd1/df_a.sas7bdat"(drop=key obs=10);
   do i=1 to 40000;
      key=key+1;
      output;
   end;
   drop i;
;run;quit;

/*
 Up to 40 obs "d:/wrk/datcut.sas7bdat" total obs=400,000

 KEY    ID    OBS    E1    E2    E3     B0    B1     B2    B3     B4     B5

   1     2     2      1     2     3    600     0    200     0    530    550
   2     2     2      1     2     3    600     0    200     0    530    550
   3     2     2      1     2     3    600     0    200     0    530    550
   4     2     2      1     2     3    600     0    200     0    530    550
   5     2     2      1     2     3    600     0    200     0    530    550
   6     2     2      1     2     3    600     0    200     0    530    550
   7     2     2      1     2     3    600     0    200     0    530    550
   8     2     2      1     2     3    600     0    200     0    530    550
*/

%macro getmode(set,fyl=d:/wrk/datcut.sas7bdat,span=40000);

    /*
      %let set=5;
      %let fyl=d:/wrk/datcut.sas7bdat;
      %let span=5000;
    */

    %local b e;

    %let e=%sysevalf(&set * &span );
    %let b=%sysevalf((&set - 1)*&span +1);

    %put Stary  &=b;
    %put END    &=e;

    * grab 50,000 subset;
    data "d:/wrk/subset&set..sas7bdat";
       set "&fyl"(firstobs=&b obs=&e);
    run;quit;

    %utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/wrk/subset&set..sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:12)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
    );


%mend getmode;

* if you tested with let statements inside macro (creates globals);
%symdel set b e fyl;

%getmode(3);

libname xpt xport 'd:/xpt/xfr3.xpt';
proc print data=xpt.keyres(obs=10);
;run;quit;
libname xpt clear;



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas
-work h:\wrk));


options noxwait noxsync;
%let tym=%sysfunc(time());
systask kill sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
systask command "&_s -termstmt %nrstr(%getmode(1);) -log G:\wrk\sys1.log" taskname=sys1;
systask command "&_s -termstmt %nrstr(%getmode(2);) -log G:\wrk\sys2.log" taskname=sys2;
systask command "&_s -termstmt %nrstr(%getmode(3);) -log G:\wrk\sys3.log" taskname=sys3;
systask command "&_s -termstmt %nrstr(%getmode(4);) -log G:\wrk\sys4.log" taskname=sys4;
systask command "&_s -termstmt %nrstr(%getmode(5);) -log G:\wrk\sys5.log" taskname=sys5;
systask command "&_s -termstmt %nrstr(%getmode(6);) -log G:\wrk\sys6.log" taskname=sys6;
systask command "&_s -termstmt %nrstr(%getmode(7);) -log G:\wrk\sys7.log" taskname=sys7;
systask command "&_s -termstmt %nrstr(%getmode(8);) -log G:\wrk\sys8.log" taskname=sys8;
waitfor sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
%put %sysevalf( %sysfunc(time()) - &tym);


/*
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
      real time           4:15.10


                                           Cumulative    Cumulative
E1    E2    E3    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------------
 1     2     .       80000       20.00         80000        20.00
 1     2     3      200000       50.00        280000        70.00
 2     2     1       40000       10.00        320000        80.00
 2     2     3       80000       20.00        400000       100.00

*/


%macro vueapp(dummy);

  %local i;
  %do i=1 %to 8;
    libname xpt&i xport "d:/xpt/xfr&i..xpt";
  %end;

  libname sd1 "d:/sd1";

  proc datasets lib=sd1 mt=view;
  delete keyresall;
  ;run;quit;

  data sd1.keyresall/view=sd1.keyresall;
    set
       xpt1.keyres
       xpt2.keyres
       xpt3.keyres
       xpt4.keyres
       xpt5.keyres
       xpt6.keyres
       xpt7.keyres
       xpt8.keyres
    ;
  ;run;quit;

  proc freq data=sd1.keyresall;
    tables e1*e2*e3 / list missing;
  ;run;quit;

%mend vueapp;

%vueapp;


libname xpt xport 'd:/xpt/xfr8.xpt';
proc print data=sd1.keyresall(where=(uniform(-1)<10/400000));
;run;quit;
libname xpt clear;

 /*
    KEY     ID    OBS    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

   12384     2     2      1     2     3    600      0    200      0    530    550       2          0
   50429     2     1      1     2     3    600      0    200      0    530    550       2          0
   54631     2     1      1     2     3    600      0    200      0    530    550       1          0
   77263     2     1      1     2     3    600      0    200      0    530    550       2          0
   77572     2     1      1     2     3    600      0    200      0    530    550       1          0
  130353     3     2      2     2     3    150    500      .    120      3     58       2          3
  233744     4     3      1     2     3    200    300    550    270      0     50       2        550
  319192     4     1      1     2     3    200    300    550    270      0     50       1        200
*/

Ksharp · Posted 06-26-2016 09:05 PM

You have two category of variables, so you need run the code two times.

data have;
input id 	obs 	e1 	e2 	e3 	b0 	b1 	b2 	b3 	b4 	b5;
datalines;
    2 	2 	1 	2 	3 	600 	0 	200 	0 	530 	550
    2 	1 	1 	2 	3 	600 	0 	200 	0 	530 	550
    3 	3 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	2 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	1 	2 	2 	1 	150 	500 	. 	120 	3 	58
    4 	3 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	2 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	1 	1 	2 	3 	200 	300 	550 	270 	0 	50
    5 	1 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	2 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	3 	1 	2 	. 	20 	20 	200 	100 	500 	57
;
run;


data want(drop=k count i max ran);
if _n_ eq 1 then do;
declare hash ha();
declare hiter hi('ha');
  ha.definekey('k');
  ha.definedata('k','count');
  ha.definedone();
declare hash random(ordered:'a');
declare hiter hir('random');
  random.definekey('ran');
  random.definedata('k','count');
  random.definedone();
end;
set have;




array x{*} e: ;
do i=1 to dim(x);
if not missing(x{i}) then do;
     k=x{i};
     if ha.find()=0 then do;count+1;ha.replace();end;
       else do;count=1;ha.add();end;
end;
end;
do while(hi.next()=0);
ran=ranuni(-1);random.add();
end;
max=0;
do while(hir.next()=0);
if count gt max then do;max=count;e_mod=k;end;
end;
ha.clear();     random.clear();






array y{*} b: ;
do i=1 to dim(y);
if not missing(y{i}) then do;
     k=y{i};
     if ha.find()=0 then do;count+1;ha.replace();end;
       else do;count=1;ha.add();end;
end;
end;
do while(hi.next()=0);
ran=ranuni(-1);random.add();
end;
max=0;
do while(hir.next()=0);
if count gt max then do;max=count;b_mod=k;end;
end;
ha.clear();     random.clear();

run;

rogerjdeangelis · Posted 06-27-2016 04:00 PM

Hi Team

There was atype in my previous heading, should have been R and SAS not R and R.

I ran Xia's code and it took about 6 seconds for 400,000 records. Much faster than R.

Mine took 4 minutes, running in parallel.

The output was the same as mine.

The FREQ Procedure

                                           Cumulative    Cumulative
E1    E2    E3    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------------
 1     2     .       80000       20.00         80000        20.00
 1     2     3      200000       50.00        280000        70.00
 2     2     1       40000       10.00        320000        80.00
 2     2     3       80000       20.00        400000       100.00

rogerjdeangelis · Posted 06-27-2016 07:12 PM

Not sure why I am getting double posts

I will try to explain Xias astonishing fast
mode algoritm


Lets explain how we arrive at B_MOD

O B0    B1     B2     B3     B4     B5     B_MOD

1 20    20    200    200    300    300       20
2 20    20    200    200    300    300      200
3 20    20    200    200    300    300      300

OBSERVATION 1 B_MOD=20

* first we examine each  of the Bs against
  all other Bs, but add only when we exceed a
  previous count.

b iter  COUNT=1 K=20 20 20 200 200 300 300
b iter  COUNT=2 K=20 20 20 200 200 300 300

b iter  COUNT=1 K=200 20 20 200 200 300 300
b iter  COUNT=2 K=200 20 20 200 200 300 300

b iter  COUNT=1 K=300 20 20 200 200 300 300
b iter  COUNT=2 K=300 20 20 200 200 300 300

* final analysis we yhe max count of 2 for three pairs;

final b COUNT=2 K=300 20 20 200 200 300 300

* nest we generate a random sequence of uniform random numbers;

ran iter b  RAN=0.9850622509 K=200 COUNT=2
ran iter b  RAN=0.8866575662 K=300 COUNT=2
ran iter b  RAN=0.2885982167 K=20 COUNT=2

iter max count  K=20  MAX=2 B_MOD=20
iter max count  K=300 MAX=2 B_MOD=20
iter max count  K=200 MAX=2 B_MOD=20
find max count  K=200 MAX=2 B_MOD=20

* they all have max=2 so we take
  the one with the lowest random number
  Remember ran is ordered

  RAN=0.2885982167 K=20

  ie B_MOD=20

OBSERVATION 2 B_MOD=200

Same as above

b iter  COUNT=1 K=20 20 20 200 200 300 300
b iter  COUNT=2 K=20 20 20 200 200 300 300

b iter  COUNT=1 K=200 20 20 200 200 300 300
b iter  COUNT=2 K=200 20 20 200 200 300 300

b iter  COUNT=1 K=300 20 20 200 200 300 300
b iter  COUNT=2 K=300 20 20 200 200 300 300

final b COUNT=2 K=300 20 20 200 200 300 300


ran iter b  RAN=0.0071778353 K=200 COUNT=2
ran iter b  RAN=0.5806072809 K=300 COUNT=2
ran iter b  RAN=0.9682415142 K=20  COUNT=2

iter max count  K=200 MAX=2 B_MOD=200
iter max count  K=300 MAX=2 B_MOD=200
iter max count  K=20 MAX=2 B_MOD=200
find max count  K=20 MAX=2 B_MOD=200


* they all have max=2 so we take
  the one with the lowest random number
  Remember ran is ordered

  RAN=0.0071778353 K=200

  ie B_MOD=200

b iter  COUNT=1 K=20 20 20 200 200 300 300
b iter  COUNT=2 K=20 20 20 200 200 300 300

b iter  COUNT=1 K=200 20 20 200 200 300 300
b iter  COUNT=2 K=200 20 20 200 200 300 300

b iter  COUNT=1 K=300 20 20 200 200 300 300
b iter  COUNT=2 K=300 20 20 200 200 300 300

final b COUNT=2 K=300 20 20 200 200 300 300

ran iter b  RAN=0.8807552 K=200 COUNT=2
ran iter b  RAN=0.2498995183 K=300 COUNT=2
ran iter b  RAN=0.7680367258 K=20 COUNT=2

iter max count  K=300 MAX=2 B_MOD=300
iter max count  K=20 MAX=2 B_MOD=300
iter max count  K=200 MAX=2 B_MOD=300
find max count  K=200 MAX=2 B_MOD=300

* they all have max=2 so we take
  the one with the lowest random number
  Remember ran is ordered

  RAN=0.2498995183 K=300

  ie B_MOD=300


%macro repmode(dummy);

%let cmt=%str(*);

%do i=1 %to 3;

    data want(drop=k count i max ran);

        /* setup */
       if _n_ eq 1 then do;

          declare hash ha ();
          declare hiter hi('ha');
          ha.definekey    ('k');
          ha.definedata   ('k','count');
          ha.definedone   ();

          declare hash random(ordered:'a');
          declare hiter hir  ('random');
          random.definekey   ('ran');
          random.definedata  ('k','count');
          random.definedone  ();
        end;

       set have(obs=1);

       array x{*} e: ;

       do i=1 to dim(x);
         if not missing(x{i}) then do;
            k=x{i};
            if ha.find()=0 then do;
               count+1;
               ha.replace();
            end;
            else do;
               count=1;
               ha.add();
            end;
         end;
        &cmt put 'e iter  ' count= e1= e2= e3=;
       end;

      &cmt put 'final e ' count= e1= e2= e3=;

       /*
       As it iterates through the e array it
       checks for for duplicates and
       updates the counter count

       Input
       E1    E2    E3

        2     2     2

       e iter  COUNT=1 E1=2 E2=2 E3=2
       e iter  COUNT=2 E1=2 E2=2 E3=2
       e iter  COUNT=3 E1=2 E2=2 E3=2

       final e COUNT=3 E1=2 E2=2 E3=2
       */

       do while(hi.next()=0);
          ran=ranuni(-1);
          random.add();
         &cmt put 'ran ' ran=;
       end;

       /*
       ran RAN=0.5100509643
       */

       max=0;
       do while(hir.next()=0);
          if count gt max then do;
              max=count;
              e_mod=k;
          end;
       end;

      &cmt put 'find max count  ' max= e_mod=;

       /*
       we have the value and the max;
       find max count  MAX=3 E_MOD=2
       */

       ha.clear();
       random.clear();

       * mobe onto the b array;
       array y{*} b: ;
       do i=1 to dim(y);
            if not missing(y{i}) then do;
             k=y{i};
             if ha.find()=0 then do;
                count+1;
                ha.replace();
             end;
             else do;
               count=1;
               ha.add();
             end;
           end;
          put 'b iter  ' count= k= b0 b1 b2 b3 b4 b5;
       end;

      put 'final b ' count= k= b0 b1 b2 b3 b4 b5;

       /*

       INPUT

        B0     B1     B2     B3     B4     B5

        20     20    200    200    500     57

       b iter  COUNT=1 20 20 200 200 500 57
       b iter  COUNT=2 20 20 200 200 500 57  The two 20s

       b iter  COUNT=1 20 20 200 200 500 57
       b iter  COUNT=2 20 20 200 200 500 57  The two 200s

       b iter  COUNT=1 20 20 200 200 500 57
       b iter  COUNT=1 20 20 200 200 500 57

       final b COUNT=1 20 20 200 200 500 57
       */

       do while(hi.next()=0);
          ran=ranuni(-1);
          random.add();
          put 'ran iter b  ' ran= k= count=;
       end;

      put 'ran fin b  ' ran= k= count=;

       /*
       ran iter b  RAN=0.5936162847
       ran iter b  RAN=0.5504719776
       ran iter b  RAN=0.1298635263
       ran iter b  RAN=0.2945035129

       ran fin b  RAN=0.2945035129
       */

       max=0;
       do while(hir.next()=0);
         if count gt max then do;
             max=count;
             b_mod=k;
          end;
          put 'iter max count  ' k= max= b_mod= ran=;
       end;

       put 'find max count  ' k= max= b_mod= rsn=;

       ha.clear();
       random.clear();

    run;

    proc append data=want base=wants;
    ;run;quit;

%end;

%mend repmode;

proc datasets lib=work;
delete wants;
;run;quit;

%utlnopts;
%repmode;

Ksharp · Posted 06-28-2016 01:42 AM

Roger,

I want you test the following IML code if it would be faster than data step.

data have;
input id 	obs 	e1 	e2 	e3 	b0 	b1 	b2 	b3 	b4 	b5;
datalines;
    2 	2 	1 	2 	3 	600 	0 	200 	0 	530 	550
    2 	1 	1 	2 	3 	600 	0 	200 	0 	530 	550
    3 	3 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	2 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	1 	2 	2 	1 	150 	500 	. 	120 	3 	58
    4 	3 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	2 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	1 	1 	2 	3 	200 	300 	550 	270 	0 	50
    5 	1 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	2 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	3 	1 	2 	. 	20 	20 	200 	100 	500 	57
;
run;
proc iml;
use have nobs nobs;
read all var {e1 e2 e3} into x;
read all var {b0 b1 b2 b3 b4 b5} into y;
close;

mode_e=j(nobs,1,.);
do i=1 to nobs;
 temp=x[i,];
 call tabulate(levels,freq,temp);
 max_idx=loc(freq=freq[<>]);
 want_idx=sample(max_idx,1);
 mode_e[i]=levels[want_idx];
end;

mode_b=j(nobs,1,.);
do i=1 to nobs;
 temp=y[i,];
 call tabulate(levels,freq,temp);
 max_idx=loc(freq=freq[<>]);
 want_idx=sample(max_idx,1);
 mode_b[i]=levels[want_idx];
end;

create mode var {mode_e mode_b};
append;
close;
quit;

data want;
 merge have mode;
run;
proc print noobs;run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away