DATA Step, Macro, Functions and more

Mode computation for different variables in SAS

Reply
New Contributor
Posts: 2

Mode computation for different variables in SAS

I need to compute the mode for different variables for the same id using SAS.  The original dataset is

 

    data df_a;
    input id 	obs 	e1 	e2 	e3 	b0 	b1 	b2 	b3 	b4 	b5;
    datalines;
    2 	2 	1 	2 	3 	600 	0 	200 	0 	530 	550
    2 	1 	1 	2 	3 	600 	0 	200 	0 	530 	550
    3 	3 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	2 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	1 	2 	2 	1 	150 	500 	. 	120 	3 	58
    4 	3 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	2 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	1 	1 	2 	3 	200 	300 	550 	270 	0 	50
    5 	1 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	2 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	3 	1 	2 	. 	20 	20 	200 	100 	500 	57
    run;

The desired output is

 

    id	obs	e1	e2	e3	mode_e	        b0	b1	b2	b3	b4	b5	mode_b
    2	2	1	2	3	random choice	600	0	200	0	530	550	0
    2	1	1	2	3	random choice	600	0	200	0	530	550	0
    3	3	2	2	3	2	            150	500	. 	120	3	58	random choice
    3	2	2	2	3	2	            150	500	. 	120	3	58	random choice
    3	1	2	2	1	2	            150	500	. 	120	3	58	random choice
    4	3	1	2	3	random choice	200	300	550	270	0	50	random choice
    4	2	1	2	3	random choice	200	300	550	270	0	50	random choice
    4	1	1	2	3	random choice	200	300	550	270	0	50	random choice
    5	1	1	2	. 	random choice	20	20	200	100	500	57	20
    5	2	1	2	. 	random choice	20	20	200	100	500	57	20
    5	3	1	2	. 	random choice	20	20	200	100	500	57	20

Points to consider
i) I need the mode for variables e1-e3 and b0-b5 in a new column (mode_e, mode_b).
iii) If multimode, select one randomly.
iv) keep all the variables in the final output.
v)  The code must be able to compute this for large database

 

I have tried to use the code proposed here https://communities.sas.com/t5/Base-SAS-Programming/Calculating-Mode-across-variables/td-p/150375 but it does not give the desired output.

 

 

 

 

 

 

 

 

Super User
Posts: 17,750

Re: Mode computation for different variables in SAS

Can we assume b/e will stay the same for each record of each ID?

How did that solution not work?

New Contributor
Posts: 2

Re: Mode computation for different variables in SAS

Hi Reeza,

 

Yes, the mode for the variables must be the same for each id. For example for id=2 and the variables e1 to e3 the mode must be chosen randomly for this id but it must be the same for both observations.

 

id  obs e1  e2  e3  mode_e  b0  b1  b2  b3  b4  b5  mode_b
2   2   1   2   3   2       600 0   200 0   530 550 0
2   1   1   2   3   2       600 0   200 0   530 550 0

The linked solution returns the minimum value rather than a random choice.

 

Super User
Posts: 17,750

Re: Mode computation for different variables in SAS

The last solution returns minimum. The green/correct one returns a random one. 

Valued Guide
Posts: 505

Re: Mode computation for different variables in SAS

Computing the mode by row (combination of R and R)

HAVE

Need the mode for variables e1-e3 and  b0-b5 in a new column (mode_e, mode_b).
If multimode, select one randomly.
If unique select select one randomly.
Keep all the variables in the final output.
The code must be able to compute this for large database.

The R code below can process about 100,000 records a minute
on my 2008 T7400(64gb ram) running 8 jobs in parallel.
10 minutes to do 1 million, 1000 minutes to 100 millon (still small scale).
Too long to get to big data (ie 10 billion +).
10 billion would take 100,000 minutes. (100000/1440) = 69 days.
SAS should be faster, but not as elgant(maybe IML)?.

The solutiion below will not scale to big data (ie 10 billion+ rows)
Maybe it will work work with millions of rows

* added a key - I like I primary key;
data "d:/sd1/df_a.sas7bdat" ;
 retain id obs key 0;
 input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
 key=_n_;
 datalines;
 2 2 1 2 3 600 0 200 0 530 550
 2 1 1 2 3 600 0 200 0 530 550
 3 3 2 2 3 150 500 . 120 3 58
 3 2 2 2 3 150 500 . 120 3 58
 3 1 2 2 1 150 500 . 120 3 58
 4 3 1 2 3 200 300 550 270 0 50
 4 2 1 2 3 200 300 550 270 0 50
 4 1 1 2 3 200 300 550 270 0 50
 5 1 1 2 . 20 20 200 200 500 57
 5 2 1 2 . 20 20 200 200 500 57
 5 3 1 2 . 20 20 200 200 500 57
 run;quit;

HAVE

/*
Up to 40 obs WC000001.DF_A total obs=11

 ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5

  2     2       1     1     2     3    600      0    200      0    530    550
  2     1       2     1     2     3    600      0    200      0    530    550
  3     3       3     2     2     3    150    500      .    120      3     58
  3     2       4     2     2     3    150    500      .    120      3     58
  3     1       5     2     2     1    150    500      .    120      3     58
  4     3       6     1     2     3    200    300    550    270      0     50
  4     2       7     1     2     3    200    300    550    270      0     50
  4     1       8     1     2     3    200    300    550    270      0     50
  5     1       9     1     2     .     20     20    200    100    500     57
  5     2      10     1     2     .     20     20    200    100    500     57
  5     3      11     1     2     .     20     20    200    100    500     57

WANT

Up to 40 obs from xpt.keyres total obs=&tob

ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

 2     2       1     1     2     3    600      0    200      0    530    550       1          0  single mode
 2     1       2     1     2     3    600      0    200      0    530    550       2          0  single mode
 3     3       3     2     2     3    150    500      .    120      3     58       3          3  random
 3     2       4     2     2     3    150    500      .    120      3     58       2        500  random
 3     1       5     2     2     1    150    500      .    120      3     58       2          3  random
 4     3       6     1     2     3    200    300    550    270      0     50       6        300  random
 4     2       7     1     2     3    200    300    550    270      0     50       7        270  random
 4     1       8     1     2     3    200    300    550    270      0     50       2        270  random
 5     1       9     1     2     .     20     20    200    200    500     57       2         20  multimode(random)
 5     2      10     1     2     .     20     20    200    200    500     57       2         200 multimode(random)
 5     3      11     1     2     .     20     20    200    200    500     57       2         20  multimode(random)

*/

SOLUTION
/*
Using list lapply find max count of duplicates
Get the values corresponding to max counts
Sample random one of multiple modes
Sample random one if no mode (all values distinct)
Sample has an issue when there is only one valee so I use resamp
*/

%utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:11)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    keyres;
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
));

/* R LOG
> library(haven); library(SASxport); datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}}; rmode <- functi
on(x) {   x <- sort(x);   u <- unique(x);   y <- lapply(u, function(y) length(x[x==y]));
u[which( unlist(y) == max(unlist(y)))]; }; mode_e<-unlist(lapply(apply(datcut[,c
(4:6)],1,rmode),resamp)); mode_b<-unlist(l

   ID OBS KEY E1 E2 E3  B0  B1  B2  B3  B4  B5 mode_e mode_b
1   2   2   1  1  2  3 600   0 200   0 530 550      2      0
2   2   1   2  1  2  3 600   0 200   0 530 550      1      0

3   3   3   3  2  2  3 150 500  NA 120   3  58      2    150 Random
4   3   2   4  2  2  3 150 500  NA 120   3  58      2    120 Random
5   3   1   5  2  2  1 150 500  NA 120   3  58      2    500 Random

6   4   3   6  1  2  3 200 300 550 270   0  50      2      0 Random
7   4   2   7  1  2  3 200 300 550 270   0  50      1    270 Random
8   4   1   8  1  2  3 200 300 550 270   0  50      1    270 Random

9   5   1   9  1  2 NA  20  20 200 200 500  57      1     200 multimode
10  5   2  10  1  2 NA  20  20 200 200 500  57      2     20  multimode
11  5   3  11  1  2 NA  20  20 200 200 500  57      2     200 multimode
*/


libname xpt xport 'd:/xpt/xfr.xpt';
proc print data=xpt.keyres;
;run;quit;
libname xpt clear;

/*
SAS Proc print
ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

  2     2       1     1     2     3    600      0    200      0    530    550       2          0
  2     1       2     1     2     3    600      0    200      0    530    550       1          0
  3     3       3     2     2     3    150    500      .    120      3     58       2        150
  3     2       4     2     2     3    150    500      .    120      3     58       2        120
  3     1       5     2     2     1    150    500      .    120      3     58       2        500
  4     3       6     1     2     3    200    300    550    270      0     50       2          0
  4     2       7     1     2     3    200    300    550    270      0     50       1        270
  4     1       8     1     2     3    200    300    550    270      0     50       1        270
  5     1       9     1     2     .     20     20    200    100    500     57       1        200
  5     2      10     1     2     .     20     20    200    100    500     57       2         20
  5     3      11     1     2     .     20     20    200    100    500     57       2        200
*/


SCALE THE PROBLEM

* This solution only scales well for small data, in the millions;
* Create a very small dataset of 400,000 obs;

* We will run 8 parallel jobs so lets
  split the 400,000  observations
  into 8 pieces of 50,000 obs each;

* Create a small datasets about 400,000 obs;

* 400,000;
data "d:/wrk/datcut.sas7bdat";
   retain key 0;
   set "d:/sd1/df_a.sas7bdat"(drop=key obs=10);
   do i=1 to 40000;
      key=key+1;
      output;
   end;
   drop i;
;run;quit;

/*
 Up to 40 obs "d:/wrk/datcut.sas7bdat" total obs=400,000

 KEY    ID    OBS    E1    E2    E3     B0    B1     B2    B3     B4     B5

   1     2     2      1     2     3    600     0    200     0    530    550
   2     2     2      1     2     3    600     0    200     0    530    550
   3     2     2      1     2     3    600     0    200     0    530    550
   4     2     2      1     2     3    600     0    200     0    530    550
   5     2     2      1     2     3    600     0    200     0    530    550
   6     2     2      1     2     3    600     0    200     0    530    550
   7     2     2      1     2     3    600     0    200     0    530    550
   8     2     2      1     2     3    600     0    200     0    530    550
*/

%macro getmode(set,fyl=d:/wrk/datcut.sas7bdat,span=40000);

    /*
      %let set=5;
      %let fyl=d:/wrk/datcut.sas7bdat;
      %let span=5000;
    */

    %local b e;

    %let e=%sysevalf(&set * &span );
    %let b=%sysevalf((&set - 1)*&span +1);

    %put Stary  &=b;
    %put END    &=e;

    * grab 50,000 subset;
    data "d:/wrk/subset&set..sas7bdat";
       set "&fyl"(firstobs=&b obs=&e);
    run;quit;

    %utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/wrk/subset&set..sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:12)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
    );


%mend getmode;

* if you tested with let statements inside macro (creates globals);
%symdel set b e fyl;

%getmode(3);

libname xpt xport 'd:/xpt/xfr3.xpt';
proc print data=xpt.keyres(obs=10);
;run;quit;
libname xpt clear;



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas
-work h:\wrk));


options noxwait noxsync;
%let tym=%sysfunc(time());
systask kill sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
systask command "&_s -termstmt %nrstr(%getmode(1);) -log G:\wrk\sys1.log" taskname=sys1;
systask command "&_s -termstmt %nrstr(%getmode(2);) -log G:\wrk\sys2.log" taskname=sys2;
systask command "&_s -termstmt %nrstr(%getmode(3);) -log G:\wrk\sys3.log" taskname=sys3;
systask command "&_s -termstmt %nrstr(%getmode(4);) -log G:\wrk\sys4.log" taskname=sys4;
systask command "&_s -termstmt %nrstr(%getmode(5);) -log G:\wrk\sys5.log" taskname=sys5;
systask command "&_s -termstmt %nrstr(%getmode(6);) -log G:\wrk\sys6.log" taskname=sys6;
systask command "&_s -termstmt %nrstr(%getmode(7);) -log G:\wrk\sys7.log" taskname=sys7;
systask command "&_s -termstmt %nrstr(%getmode(8);) -log G:\wrk\sys8.log" taskname=sys8;
waitfor sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
%put %sysevalf( %sysfunc(time()) - &tym);


/*
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
      real time           4:15.10


                                           Cumulative    Cumulative
E1    E2    E3    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------------
 1     2     .       80000       20.00         80000        20.00
 1     2     3      200000       50.00        280000        70.00
 2     2     1       40000       10.00        320000        80.00
 2     2     3       80000       20.00        400000       100.00

*/


%macro vueapp(dummy);

  %local i;
  %do i=1 %to 8;
    libname xpt&i xport "d:/xpt/xfr&i..xpt";
  %end;

  libname sd1 "d:/sd1";

  proc datasets lib=sd1 mt=view;
  delete keyresall;
  ;run;quit;

  data sd1.keyresall/view=sd1.keyresall;
    set
       xpt1.keyres
       xpt2.keyres
       xpt3.keyres
       xpt4.keyres
       xpt5.keyres
       xpt6.keyres
       xpt7.keyres
       xpt8.keyres
    ;
  ;run;quit;

  proc freq data=sd1.keyresall;
    tables e1*e2*e3 / list missing;
  ;run;quit;

%mend vueapp;

%vueapp;


libname xpt xport 'd:/xpt/xfr8.xpt';
proc print data=sd1.keyresall(where=(uniform(-1)<10/400000));
;run;quit;
libname xpt clear;

 /*
    KEY     ID    OBS    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

   12384     2     2      1     2     3    600      0    200      0    530    550       2          0
   50429     2     1      1     2     3    600      0    200      0    530    550       2          0
   54631     2     1      1     2     3    600      0    200      0    530    550       1          0
   77263     2     1      1     2     3    600      0    200      0    530    550       2          0
   77572     2     1      1     2     3    600      0    200      0    530    550       1          0
  130353     3     2      2     2     3    150    500      .    120      3     58       2          3
  233744     4     3      1     2     3    200    300    550    270      0     50       2        550
  319192     4     1      1     2     3    200    300    550    270      0     50       1        200
*/


Valued Guide
Posts: 505

Re: Mode computation for different variables in SAS

Computing the mode by row (combination of R and R)

HAVE

Need the mode for variables e1-e3 and  b0-b5 in a new column (mode_e, mode_b).
If multimode, select one randomly.
If unique select select one randomly.
Keep all the variables in the final output.
The code must be able to compute this for large database.

The R code below can process about 100,000 records a minute
on my 2008 T7400(64gb ram) running 8 jobs in parallel.
10 minutes to do 1 million, 1000 minutes to 100 millon (still small scale).
Too long to get to big data (ie 10 billion +).
10 billion would take 100,000 minutes. (100000/1440) = 69 days.
SAS should be faster, but not as elgant(maybe IML)?.

The solutiion below will not scale to big data (ie 10 billion+ rows)
Maybe it will work work with millions of rows

* added a key - I like I primary key;
data "d:/sd1/df_a.sas7bdat" ;
 retain id obs key 0;
 input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
 key=_n_;
 datalines;
 2 2 1 2 3 600 0 200 0 530 550
 2 1 1 2 3 600 0 200 0 530 550
 3 3 2 2 3 150 500 . 120 3 58
 3 2 2 2 3 150 500 . 120 3 58
 3 1 2 2 1 150 500 . 120 3 58
 4 3 1 2 3 200 300 550 270 0 50
 4 2 1 2 3 200 300 550 270 0 50
 4 1 1 2 3 200 300 550 270 0 50
 5 1 1 2 . 20 20 200 200 500 57
 5 2 1 2 . 20 20 200 200 500 57
 5 3 1 2 . 20 20 200 200 500 57
 run;quit;

HAVE

/*
Up to 40 obs WC000001.DF_A total obs=11

 ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5

  2     2       1     1     2     3    600      0    200      0    530    550
  2     1       2     1     2     3    600      0    200      0    530    550
  3     3       3     2     2     3    150    500      .    120      3     58
  3     2       4     2     2     3    150    500      .    120      3     58
  3     1       5     2     2     1    150    500      .    120      3     58
  4     3       6     1     2     3    200    300    550    270      0     50
  4     2       7     1     2     3    200    300    550    270      0     50
  4     1       8     1     2     3    200    300    550    270      0     50
  5     1       9     1     2     .     20     20    200    100    500     57
  5     2      10     1     2     .     20     20    200    100    500     57
  5     3      11     1     2     .     20     20    200    100    500     57

WANT

Up to 40 obs from xpt.keyres total obs=&tob

ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

 2     2       1     1     2     3    600      0    200      0    530    550       1          0  single mode
 2     1       2     1     2     3    600      0    200      0    530    550       2          0  single mode
 3     3       3     2     2     3    150    500      .    120      3     58       3          3  random
 3     2       4     2     2     3    150    500      .    120      3     58       2        500  random
 3     1       5     2     2     1    150    500      .    120      3     58       2          3  random
 4     3       6     1     2     3    200    300    550    270      0     50       6        300  random
 4     2       7     1     2     3    200    300    550    270      0     50       7        270  random
 4     1       8     1     2     3    200    300    550    270      0     50       2        270  random
 5     1       9     1     2     .     20     20    200    200    500     57       2         20  multimode(random)
 5     2      10     1     2     .     20     20    200    200    500     57       2         200 multimode(random)
 5     3      11     1     2     .     20     20    200    200    500     57       2         20  multimode(random)

*/

SOLUTION
/*
Using list lapply find max count of duplicates
Get the values corresponding to max counts
Sample random one of multiple modes
Sample random one if no mode (all values distinct)
Sample has an issue when there is only one valee so I use resamp
*/

%utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:11)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    keyres;
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
));

/* R LOG
> library(haven); library(SASxport); datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}}; rmode <- functi
on(x) {   x <- sort(x);   u <- unique(x);   y <- lapply(u, function(y) length(x[x==y]));
u[which( unlist(y) == max(unlist(y)))]; }; mode_e<-unlist(lapply(apply(datcut[,c
(4:6)],1,rmode),resamp)); mode_b<-unlist(l

   ID OBS KEY E1 E2 E3  B0  B1  B2  B3  B4  B5 mode_e mode_b
1   2   2   1  1  2  3 600   0 200   0 530 550      2      0
2   2   1   2  1  2  3 600   0 200   0 530 550      1      0

3   3   3   3  2  2  3 150 500  NA 120   3  58      2    150 Random
4   3   2   4  2  2  3 150 500  NA 120   3  58      2    120 Random
5   3   1   5  2  2  1 150 500  NA 120   3  58      2    500 Random

6   4   3   6  1  2  3 200 300 550 270   0  50      2      0 Random
7   4   2   7  1  2  3 200 300 550 270   0  50      1    270 Random
8   4   1   8  1  2  3 200 300 550 270   0  50      1    270 Random

9   5   1   9  1  2 NA  20  20 200 200 500  57      1     200 multimode
10  5   2  10  1  2 NA  20  20 200 200 500  57      2     20  multimode
11  5   3  11  1  2 NA  20  20 200 200 500  57      2     200 multimode
*/


libname xpt xport 'd:/xpt/xfr.xpt';
proc print data=xpt.keyres;
;run;quit;
libname xpt clear;

/*
SAS Proc print
ID    OBS    KEY    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

  2     2       1     1     2     3    600      0    200      0    530    550       2          0
  2     1       2     1     2     3    600      0    200      0    530    550       1          0
  3     3       3     2     2     3    150    500      .    120      3     58       2        150
  3     2       4     2     2     3    150    500      .    120      3     58       2        120
  3     1       5     2     2     1    150    500      .    120      3     58       2        500
  4     3       6     1     2     3    200    300    550    270      0     50       2          0
  4     2       7     1     2     3    200    300    550    270      0     50       1        270
  4     1       8     1     2     3    200    300    550    270      0     50       1        270
  5     1       9     1     2     .     20     20    200    100    500     57       1        200
  5     2      10     1     2     .     20     20    200    100    500     57       2         20
  5     3      11     1     2     .     20     20    200    100    500     57       2        200
*/


SCALE THE PROBLEM

* This solution only scales well for small data, in the millions;
* Create a very small dataset of 400,000 obs;

* We will run 8 parallel jobs so lets
  split the 400,000  observations
  into 8 pieces of 50,000 obs each;

* Create a small datasets about 400,000 obs;

* 400,000;
data "d:/wrk/datcut.sas7bdat";
   retain key 0;
   set "d:/sd1/df_a.sas7bdat"(drop=key obs=10);
   do i=1 to 40000;
      key=key+1;
      output;
   end;
   drop i;
;run;quit;

/*
 Up to 40 obs "d:/wrk/datcut.sas7bdat" total obs=400,000

 KEY    ID    OBS    E1    E2    E3     B0    B1     B2    B3     B4     B5

   1     2     2      1     2     3    600     0    200     0    530    550
   2     2     2      1     2     3    600     0    200     0    530    550
   3     2     2      1     2     3    600     0    200     0    530    550
   4     2     2      1     2     3    600     0    200     0    530    550
   5     2     2      1     2     3    600     0    200     0    530    550
   6     2     2      1     2     3    600     0    200     0    530    550
   7     2     2      1     2     3    600     0    200     0    530    550
   8     2     2      1     2     3    600     0    200     0    530    550
*/

%macro getmode(set,fyl=d:/wrk/datcut.sas7bdat,span=40000);

    /*
      %let set=5;
      %let fyl=d:/wrk/datcut.sas7bdat;
      %let span=5000;
    */

    %local b e;

    %let e=%sysevalf(&set * &span );
    %let b=%sysevalf((&set - 1)*&span +1);

    %put Stary  &=b;
    %put END    &=e;

    * grab 50,000 subset;
    data "d:/wrk/subset&set..sas7bdat";
       set "&fyl"(firstobs=&b obs=&e);
    run;quit;

    %utl_submit_r(
    library(haven);
    library(SASxport);
    datcut<-as.matrix(read_sas('d:/wrk/subset&set..sas7bdat'));
    /* Begin R solution */
    resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
    rmode <- function(x) {
      x <- sort(x);
      u <- unique(x);
      if (length(x) == length(u)) {resamp(x)} else {
         y <- lapply(u, function(y) length(x[x==y]));
         resamp(u[which( unlist(y) == max(unlist(y)))]);
      }
    };
    mode_e<-apply(datcut[,c(4:6)],1,rmode);
    mode_b<-apply(datcut[,c(7:12)],1,rmode);
    keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
    write.xport(keyres,file='d:/xpt/xfr&set..xpt');
    );


%mend getmode;

* if you tested with let statements inside macro (creates globals);
%symdel set b e fyl;

%getmode(3);

libname xpt xport 'd:/xpt/xfr3.xpt';
proc print data=xpt.keyres(obs=10);
;run;quit;
libname xpt clear;



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas
-work h:\wrk));


options noxwait noxsync;
%let tym=%sysfunc(time());
systask kill sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
systask command "&_s -termstmt %nrstr(%getmode(1);) -log G:\wrk\sys1.log" taskname=sys1;
systask command "&_s -termstmt %nrstr(%getmode(2);) -log G:\wrk\sys2.log" taskname=sys2;
systask command "&_s -termstmt %nrstr(%getmode(3);) -log G:\wrk\sys3.log" taskname=sys3;
systask command "&_s -termstmt %nrstr(%getmode(4);) -log G:\wrk\sys4.log" taskname=sys4;
systask command "&_s -termstmt %nrstr(%getmode(5);) -log G:\wrk\sys5.log" taskname=sys5;
systask command "&_s -termstmt %nrstr(%getmode(6);) -log G:\wrk\sys6.log" taskname=sys6;
systask command "&_s -termstmt %nrstr(%getmode(7);) -log G:\wrk\sys7.log" taskname=sys7;
systask command "&_s -termstmt %nrstr(%getmode(8);) -log G:\wrk\sys8.log" taskname=sys8;
waitfor sys1 sys2 sys3 sys4  sys5 sys6 sys7 sys8;
%put %sysevalf( %sysfunc(time()) - &tym);


/*
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
      real time           4:15.10


                                           Cumulative    Cumulative
E1    E2    E3    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------------
 1     2     .       80000       20.00         80000        20.00
 1     2     3      200000       50.00        280000        70.00
 2     2     1       40000       10.00        320000        80.00
 2     2     3       80000       20.00        400000       100.00

*/


%macro vueapp(dummy);

  %local i;
  %do i=1 %to 8;
    libname xpt&i xport "d:/xpt/xfr&i..xpt";
  %end;

  libname sd1 "d:/sd1";

  proc datasets lib=sd1 mt=view;
  delete keyresall;
  ;run;quit;

  data sd1.keyresall/view=sd1.keyresall;
    set
       xpt1.keyres
       xpt2.keyres
       xpt3.keyres
       xpt4.keyres
       xpt5.keyres
       xpt6.keyres
       xpt7.keyres
       xpt8.keyres
    ;
  ;run;quit;

  proc freq data=sd1.keyresall;
    tables e1*e2*e3 / list missing;
  ;run;quit;

%mend vueapp;

%vueapp;


libname xpt xport 'd:/xpt/xfr8.xpt';
proc print data=sd1.keyresall(where=(uniform(-1)<10/400000));
;run;quit;
libname xpt clear;

 /*
    KEY     ID    OBS    E1    E2    E3     B0     B1     B2     B3     B4     B5    MODE_E    MODE_B

   12384     2     2      1     2     3    600      0    200      0    530    550       2          0
   50429     2     1      1     2     3    600      0    200      0    530    550       2          0
   54631     2     1      1     2     3    600      0    200      0    530    550       1          0
   77263     2     1      1     2     3    600      0    200      0    530    550       2          0
   77572     2     1      1     2     3    600      0    200      0    530    550       1          0
  130353     3     2      2     2     3    150    500      .    120      3     58       2          3
  233744     4     3      1     2     3    200    300    550    270      0     50       2        550
  319192     4     1      1     2     3    200    300    550    270      0     50       1        200
*/


Super User
Posts: 9,662

Re: Mode computation for different variables in SAS

You have two category of variables, so you need run the code two times.

 

 

data have;
input id 	obs 	e1 	e2 	e3 	b0 	b1 	b2 	b3 	b4 	b5;
datalines;
    2 	2 	1 	2 	3 	600 	0 	200 	0 	530 	550
    2 	1 	1 	2 	3 	600 	0 	200 	0 	530 	550
    3 	3 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	2 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	1 	2 	2 	1 	150 	500 	. 	120 	3 	58
    4 	3 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	2 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	1 	1 	2 	3 	200 	300 	550 	270 	0 	50
    5 	1 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	2 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	3 	1 	2 	. 	20 	20 	200 	100 	500 	57
;
run;


data want(drop=k count i max ran);
if _n_ eq 1 then do;
declare hash ha();
declare hiter hi('ha');
  ha.definekey('k');
  ha.definedata('k','count');
  ha.definedone();
declare hash random(ordered:'a');
declare hiter hir('random');
  random.definekey('ran');
  random.definedata('k','count');
  random.definedone();
end;
set have;




array x{*} e: ;
do i=1 to dim(x);
if not missing(x{i}) then do;
     k=x{i};
     if ha.find()=0 then do;count+1;ha.replace();end;
       else do;count=1;ha.add();end;
end;
end;
do while(hi.next()=0);
ran=ranuni(-1);random.add();
end;
max=0;
do while(hir.next()=0);
if count gt max then do;max=count;e_mod=k;end;
end;
ha.clear();     random.clear();






array y{*} b: ;
do i=1 to dim(y);
if not missing(y{i}) then do;
     k=y{i};
     if ha.find()=0 then do;count+1;ha.replace();end;
       else do;count=1;ha.add();end;
end;
end;
do while(hi.next()=0);
ran=ranuni(-1);random.add();
end;
max=0;
do while(hir.next()=0);
if count gt max then do;max=count;b_mod=k;end;
end;
ha.clear();     random.clear();

run;
Valued Guide
Posts: 505

Re: Mode computation for different variables in SAS

Hi Team

 

  There was atype in my previous heading, should have been R and SAS not R and R.

 

  I ran Xia's code and it took about 6 seconds for 400,000 records. Much faster than R.

  Mine took 4 minutes, running in parallel.

 

  The output was the same as mine.

 

The FREQ Procedure

                                           Cumulative    Cumulative
E1    E2    E3    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------------
 1     2     .       80000       20.00         80000        20.00
 1     2     3      200000       50.00        280000        70.00
 2     2     1       40000       10.00        320000        80.00
 2     2     3       80000       20.00        400000       100.00
Valued Guide
Posts: 505

Re: Mode computation for different variables in SAS

Not sure why I am getting double posts

I will try to explain Xias astonishing fast mode algoritm Lets explain how we arrive at B_MOD O B0 B1 B2 B3 B4 B5 B_MOD 1 20 20 200 200 300 300 20 2 20 20 200 200 300 300 200 3 20 20 200 200 300 300 300 OBSERVATION 1 B_MOD=20 * first we examine each of the Bs against all other Bs, but add only when we exceed a previous count. b iter COUNT=1 K=20 20 20 200 200 300 300 b iter COUNT=2 K=20 20 20 200 200 300 300 b iter COUNT=1 K=200 20 20 200 200 300 300 b iter COUNT=2 K=200 20 20 200 200 300 300 b iter COUNT=1 K=300 20 20 200 200 300 300 b iter COUNT=2 K=300 20 20 200 200 300 300 * final analysis we yhe max count of 2 for three pairs; final b COUNT=2 K=300 20 20 200 200 300 300 * nest we generate a random sequence of uniform random numbers; ran iter b RAN=0.9850622509 K=200 COUNT=2 ran iter b RAN=0.8866575662 K=300 COUNT=2 ran iter b RAN=0.2885982167 K=20 COUNT=2 iter max count K=20 MAX=2 B_MOD=20 iter max count K=300 MAX=2 B_MOD=20 iter max count K=200 MAX=2 B_MOD=20 find max count K=200 MAX=2 B_MOD=20 * they all have max=2 so we take the one with the lowest random number Remember ran is ordered RAN=0.2885982167 K=20 ie B_MOD=20 OBSERVATION 2 B_MOD=200 Same as above b iter COUNT=1 K=20 20 20 200 200 300 300 b iter COUNT=2 K=20 20 20 200 200 300 300 b iter COUNT=1 K=200 20 20 200 200 300 300 b iter COUNT=2 K=200 20 20 200 200 300 300 b iter COUNT=1 K=300 20 20 200 200 300 300 b iter COUNT=2 K=300 20 20 200 200 300 300 final b COUNT=2 K=300 20 20 200 200 300 300 ran iter b RAN=0.0071778353 K=200 COUNT=2 ran iter b RAN=0.5806072809 K=300 COUNT=2 ran iter b RAN=0.9682415142 K=20 COUNT=2 iter max count K=200 MAX=2 B_MOD=200 iter max count K=300 MAX=2 B_MOD=200 iter max count K=20 MAX=2 B_MOD=200 find max count K=20 MAX=2 B_MOD=200 * they all have max=2 so we take the one with the lowest random number Remember ran is ordered RAN=0.0071778353 K=200 ie B_MOD=200 b iter COUNT=1 K=20 20 20 200 200 300 300 b iter COUNT=2 K=20 20 20 200 200 300 300 b iter COUNT=1 K=200 20 20 200 200 300 300 b iter COUNT=2 K=200 20 20 200 200 300 300 b iter COUNT=1 K=300 20 20 200 200 300 300 b iter COUNT=2 K=300 20 20 200 200 300 300 final b COUNT=2 K=300 20 20 200 200 300 300 ran iter b RAN=0.8807552 K=200 COUNT=2 ran iter b RAN=0.2498995183 K=300 COUNT=2 ran iter b RAN=0.7680367258 K=20 COUNT=2 iter max count K=300 MAX=2 B_MOD=300 iter max count K=20 MAX=2 B_MOD=300 iter max count K=200 MAX=2 B_MOD=300 find max count K=200 MAX=2 B_MOD=300 * they all have max=2 so we take the one with the lowest random number Remember ran is ordered RAN=0.2498995183 K=300 ie B_MOD=300 %macro repmode(dummy); %let cmt=%str(*); %do i=1 %to 3; data want(drop=k count i max ran); /* setup */ if _n_ eq 1 then do; declare hash ha (); declare hiter hi('ha'); ha.definekey ('k'); ha.definedata ('k','count'); ha.definedone (); declare hash random(ordered:'a'); declare hiter hir ('random'); random.definekey ('ran'); random.definedata ('k','count'); random.definedone (); end; set have(obs=1); array x{*} e: ; do i=1 to dim(x); if not missing(x{i}) then do; k=x{i}; if ha.find()=0 then do; count+1; ha.replace(); end; else do; count=1; ha.add(); end; end; &cmt put 'e iter ' count= e1= e2= e3=; end; &cmt put 'final e ' count= e1= e2= e3=; /* As it iterates through the e array it checks for for duplicates and updates the counter count Input E1 E2 E3 2 2 2 e iter COUNT=1 E1=2 E2=2 E3=2 e iter COUNT=2 E1=2 E2=2 E3=2 e iter COUNT=3 E1=2 E2=2 E3=2 final e COUNT=3 E1=2 E2=2 E3=2 */ do while(hi.next()=0); ran=ranuni(-1); random.add(); &cmt put 'ran ' ran=; end; /* ran RAN=0.5100509643 */ max=0; do while(hir.next()=0); if count gt max then do; max=count; e_mod=k; end; end; &cmt put 'find max count ' max= e_mod=; /* we have the value and the max; find max count MAX=3 E_MOD=2 */ ha.clear(); random.clear(); * mobe onto the b array; array y{*} b: ; do i=1 to dim(y); if not missing(y{i}) then do; k=y{i}; if ha.find()=0 then do; count+1; ha.replace(); end; else do; count=1; ha.add(); end; end; put 'b iter ' count= k= b0 b1 b2 b3 b4 b5; end; put 'final b ' count= k= b0 b1 b2 b3 b4 b5; /* INPUT B0 B1 B2 B3 B4 B5 20 20 200 200 500 57 b iter COUNT=1 20 20 200 200 500 57 b iter COUNT=2 20 20 200 200 500 57 The two 20s b iter COUNT=1 20 20 200 200 500 57 b iter COUNT=2 20 20 200 200 500 57 The two 200s b iter COUNT=1 20 20 200 200 500 57 b iter COUNT=1 20 20 200 200 500 57 final b COUNT=1 20 20 200 200 500 57 */ do while(hi.next()=0); ran=ranuni(-1); random.add(); put 'ran iter b ' ran= k= count=; end; put 'ran fin b ' ran= k= count=; /* ran iter b RAN=0.5936162847 ran iter b RAN=0.5504719776 ran iter b RAN=0.1298635263 ran iter b RAN=0.2945035129 ran fin b RAN=0.2945035129 */ max=0; do while(hir.next()=0); if count gt max then do; max=count; b_mod=k; end; put 'iter max count ' k= max= b_mod= ran=; end; put 'find max count ' k= max= b_mod= rsn=; ha.clear(); random.clear(); run; proc append data=want base=wants; ;run;quit; %end; %mend repmode; proc datasets lib=work; delete wants; ;run;quit; %utlnopts; %repmode;
Super User
Posts: 9,662

Re: Mode computation for different variables in SAS

Roger,

I want you test the following IML code if it would be faster than data step.

 

 

data have;
input id 	obs 	e1 	e2 	e3 	b0 	b1 	b2 	b3 	b4 	b5;
datalines;
    2 	2 	1 	2 	3 	600 	0 	200 	0 	530 	550
    2 	1 	1 	2 	3 	600 	0 	200 	0 	530 	550
    3 	3 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	2 	2 	2 	3 	150 	500 	. 	120 	3 	58
    3 	1 	2 	2 	1 	150 	500 	. 	120 	3 	58
    4 	3 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	2 	1 	2 	3 	200 	300 	550 	270 	0 	50
    4 	1 	1 	2 	3 	200 	300 	550 	270 	0 	50
    5 	1 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	2 	1 	2 	. 	20 	20 	200 	100 	500 	57
    5 	3 	1 	2 	. 	20 	20 	200 	100 	500 	57
;
run;
proc iml;
use have nobs nobs;
read all var {e1 e2 e3} into x;
read all var {b0 b1 b2 b3 b4 b5} into y;
close;

mode_e=j(nobs,1,.);
do i=1 to nobs;
 temp=x[i,];
 call tabulate(levels,freq,temp);
 max_idx=loc(freq=freq[<>]);
 want_idx=sample(max_idx,1);
 mode_e[i]=levels[want_idx];
end;

mode_b=j(nobs,1,.);
do i=1 to nobs;
 temp=y[i,];
 call tabulate(levels,freq,temp);
 max_idx=loc(freq=freq[<>]);
 want_idx=sample(max_idx,1);
 mode_b[i]=levels[want_idx];
end;

create mode var {mode_e mode_b};
append;
close;
quit;

data want;
 merge have mode;
run;
proc print noobs;run;
Ask a Question
Discussion stats
  • 9 replies
  • 360 views
  • 0 likes
  • 4 in conversation