I need to compute the mode for different variables for the same id using SAS. The original dataset is
data df_a;
input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
datalines;
2 2 1 2 3 600 0 200 0 530 550
2 1 1 2 3 600 0 200 0 530 550
3 3 2 2 3 150 500 . 120 3 58
3 2 2 2 3 150 500 . 120 3 58
3 1 2 2 1 150 500 . 120 3 58
4 3 1 2 3 200 300 550 270 0 50
4 2 1 2 3 200 300 550 270 0 50
4 1 1 2 3 200 300 550 270 0 50
5 1 1 2 . 20 20 200 100 500 57
5 2 1 2 . 20 20 200 100 500 57
5 3 1 2 . 20 20 200 100 500 57
run;
The desired output is
id obs e1 e2 e3 mode_e b0 b1 b2 b3 b4 b5 mode_b
2 2 1 2 3 random choice 600 0 200 0 530 550 0
2 1 1 2 3 random choice 600 0 200 0 530 550 0
3 3 2 2 3 2 150 500 . 120 3 58 random choice
3 2 2 2 3 2 150 500 . 120 3 58 random choice
3 1 2 2 1 2 150 500 . 120 3 58 random choice
4 3 1 2 3 random choice 200 300 550 270 0 50 random choice
4 2 1 2 3 random choice 200 300 550 270 0 50 random choice
4 1 1 2 3 random choice 200 300 550 270 0 50 random choice
5 1 1 2 . random choice 20 20 200 100 500 57 20
5 2 1 2 . random choice 20 20 200 100 500 57 20
5 3 1 2 . random choice 20 20 200 100 500 57 20
Points to consider
i) I need the mode for variables e1-e3 and b0-b5 in a new column (mode_e, mode_b).
iii) If multimode, select one randomly.
iv) keep all the variables in the final output.
v) The code must be able to compute this for large database
I have tried to use the code proposed here https://communities.sas.com/t5/Base-SAS-Programming/Calculating-Mode-across-variables/td-p/150375 but it does not give the desired output.
Can we assume b/e will stay the same for each record of each ID?
How did that solution not work?
Hi Reeza,
Yes, the mode for the variables must be the same for each id. For example for id=2 and the variables e1 to e3 the mode must be chosen randomly for this id but it must be the same for both observations.
id obs e1 e2 e3 mode_e b0 b1 b2 b3 b4 b5 mode_b
2 2 1 2 3 2 600 0 200 0 530 550 0
2 1 1 2 3 2 600 0 200 0 530 550 0
The linked solution returns the minimum value rather than a random choice.
The last solution returns minimum. The green/correct one returns a random one.
Computing the mode by row (combination of R and R)
HAVE
Need the mode for variables e1-e3 and b0-b5 in a new column (mode_e, mode_b).
If multimode, select one randomly.
If unique select select one randomly.
Keep all the variables in the final output.
The code must be able to compute this for large database.
The R code below can process about 100,000 records a minute
on my 2008 T7400(64gb ram) running 8 jobs in parallel.
10 minutes to do 1 million, 1000 minutes to 100 millon (still small scale).
Too long to get to big data (ie 10 billion +).
10 billion would take 100,000 minutes. (100000/1440) = 69 days.
SAS should be faster, but not as elgant(maybe IML)?.
The solutiion below will not scale to big data (ie 10 billion+ rows)
Maybe it will work work with millions of rows
* added a key - I like I primary key;
data "d:/sd1/df_a.sas7bdat" ;
retain id obs key 0;
input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
key=_n_;
datalines;
2 2 1 2 3 600 0 200 0 530 550
2 1 1 2 3 600 0 200 0 530 550
3 3 2 2 3 150 500 . 120 3 58
3 2 2 2 3 150 500 . 120 3 58
3 1 2 2 1 150 500 . 120 3 58
4 3 1 2 3 200 300 550 270 0 50
4 2 1 2 3 200 300 550 270 0 50
4 1 1 2 3 200 300 550 270 0 50
5 1 1 2 . 20 20 200 200 500 57
5 2 1 2 . 20 20 200 200 500 57
5 3 1 2 . 20 20 200 200 500 57
run;quit;
HAVE
/*
Up to 40 obs WC000001.DF_A total obs=11
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5
2 2 1 1 2 3 600 0 200 0 530 550
2 1 2 1 2 3 600 0 200 0 530 550
3 3 3 2 2 3 150 500 . 120 3 58
3 2 4 2 2 3 150 500 . 120 3 58
3 1 5 2 2 1 150 500 . 120 3 58
4 3 6 1 2 3 200 300 550 270 0 50
4 2 7 1 2 3 200 300 550 270 0 50
4 1 8 1 2 3 200 300 550 270 0 50
5 1 9 1 2 . 20 20 200 100 500 57
5 2 10 1 2 . 20 20 200 100 500 57
5 3 11 1 2 . 20 20 200 100 500 57
WANT
Up to 40 obs from xpt.keyres total obs=&tob
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5 MODE_E MODE_B
2 2 1 1 2 3 600 0 200 0 530 550 1 0 single mode
2 1 2 1 2 3 600 0 200 0 530 550 2 0 single mode
3 3 3 2 2 3 150 500 . 120 3 58 3 3 random
3 2 4 2 2 3 150 500 . 120 3 58 2 500 random
3 1 5 2 2 1 150 500 . 120 3 58 2 3 random
4 3 6 1 2 3 200 300 550 270 0 50 6 300 random
4 2 7 1 2 3 200 300 550 270 0 50 7 270 random
4 1 8 1 2 3 200 300 550 270 0 50 2 270 random
5 1 9 1 2 . 20 20 200 200 500 57 2 20 multimode(random)
5 2 10 1 2 . 20 20 200 200 500 57 2 200 multimode(random)
5 3 11 1 2 . 20 20 200 200 500 57 2 20 multimode(random)
*/
SOLUTION
/*
Using list lapply find max count of duplicates
Get the values corresponding to max counts
Sample random one of multiple modes
Sample random one if no mode (all values distinct)
Sample has an issue when there is only one valee so I use resamp
*/
%utl_submit_r(
library(haven);
library(SASxport);
datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
/* Begin R solution */
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
rmode <- function(x) {
x <- sort(x);
u <- unique(x);
if (length(x) == length(u)) {resamp(x)} else {
y <- lapply(u, function(y) length(x[x==y]));
resamp(u[which( unlist(y) == max(unlist(y)))]);
}
};
mode_e<-apply(datcut[,c(4:6)],1,rmode);
mode_b<-apply(datcut[,c(7:11)],1,rmode);
keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
keyres;
write.xport(keyres,file='d:/xpt/xfr&set..xpt');
));
/* R LOG
> library(haven); library(SASxport); datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}}; rmode <- functi
on(x) { x <- sort(x); u <- unique(x); y <- lapply(u, function(y) length(x[x==y]));
u[which( unlist(y) == max(unlist(y)))]; }; mode_e<-unlist(lapply(apply(datcut[,c
(4:6)],1,rmode),resamp)); mode_b<-unlist(l
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5 mode_e mode_b
1 2 2 1 1 2 3 600 0 200 0 530 550 2 0
2 2 1 2 1 2 3 600 0 200 0 530 550 1 0
3 3 3 3 2 2 3 150 500 NA 120 3 58 2 150 Random
4 3 2 4 2 2 3 150 500 NA 120 3 58 2 120 Random
5 3 1 5 2 2 1 150 500 NA 120 3 58 2 500 Random
6 4 3 6 1 2 3 200 300 550 270 0 50 2 0 Random
7 4 2 7 1 2 3 200 300 550 270 0 50 1 270 Random
8 4 1 8 1 2 3 200 300 550 270 0 50 1 270 Random
9 5 1 9 1 2 NA 20 20 200 200 500 57 1 200 multimode
10 5 2 10 1 2 NA 20 20 200 200 500 57 2 20 multimode
11 5 3 11 1 2 NA 20 20 200 200 500 57 2 200 multimode
*/
libname xpt xport 'd:/xpt/xfr.xpt';
proc print data=xpt.keyres;
;run;quit;
libname xpt clear;
/*
SAS Proc print
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5 MODE_E MODE_B
2 2 1 1 2 3 600 0 200 0 530 550 2 0
2 1 2 1 2 3 600 0 200 0 530 550 1 0
3 3 3 2 2 3 150 500 . 120 3 58 2 150
3 2 4 2 2 3 150 500 . 120 3 58 2 120
3 1 5 2 2 1 150 500 . 120 3 58 2 500
4 3 6 1 2 3 200 300 550 270 0 50 2 0
4 2 7 1 2 3 200 300 550 270 0 50 1 270
4 1 8 1 2 3 200 300 550 270 0 50 1 270
5 1 9 1 2 . 20 20 200 100 500 57 1 200
5 2 10 1 2 . 20 20 200 100 500 57 2 20
5 3 11 1 2 . 20 20 200 100 500 57 2 200
*/
SCALE THE PROBLEM
* This solution only scales well for small data, in the millions;
* Create a very small dataset of 400,000 obs;
* We will run 8 parallel jobs so lets
split the 400,000 observations
into 8 pieces of 50,000 obs each;
* Create a small datasets about 400,000 obs;
* 400,000;
data "d:/wrk/datcut.sas7bdat";
retain key 0;
set "d:/sd1/df_a.sas7bdat"(drop=key obs=10);
do i=1 to 40000;
key=key+1;
output;
end;
drop i;
;run;quit;
/*
Up to 40 obs "d:/wrk/datcut.sas7bdat" total obs=400,000
KEY ID OBS E1 E2 E3 B0 B1 B2 B3 B4 B5
1 2 2 1 2 3 600 0 200 0 530 550
2 2 2 1 2 3 600 0 200 0 530 550
3 2 2 1 2 3 600 0 200 0 530 550
4 2 2 1 2 3 600 0 200 0 530 550
5 2 2 1 2 3 600 0 200 0 530 550
6 2 2 1 2 3 600 0 200 0 530 550
7 2 2 1 2 3 600 0 200 0 530 550
8 2 2 1 2 3 600 0 200 0 530 550
*/
%macro getmode(set,fyl=d:/wrk/datcut.sas7bdat,span=40000);
/*
%let set=5;
%let fyl=d:/wrk/datcut.sas7bdat;
%let span=5000;
*/
%local b e;
%let e=%sysevalf(&set * &span );
%let b=%sysevalf((&set - 1)*&span +1);
%put Stary &=b;
%put END &=e;
* grab 50,000 subset;
data "d:/wrk/subset&set..sas7bdat";
set "&fyl"(firstobs=&b obs=&e);
run;quit;
%utl_submit_r(
library(haven);
library(SASxport);
datcut<-as.matrix(read_sas('d:/wrk/subset&set..sas7bdat'));
/* Begin R solution */
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
rmode <- function(x) {
x <- sort(x);
u <- unique(x);
if (length(x) == length(u)) {resamp(x)} else {
y <- lapply(u, function(y) length(x[x==y]));
resamp(u[which( unlist(y) == max(unlist(y)))]);
}
};
mode_e<-apply(datcut[,c(4:6)],1,rmode);
mode_b<-apply(datcut[,c(7:12)],1,rmode);
keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
write.xport(keyres,file='d:/xpt/xfr&set..xpt');
);
%mend getmode;
* if you tested with let statements inside macro (creates globals);
%symdel set b e fyl;
%getmode(3);
libname xpt xport 'd:/xpt/xfr3.xpt';
proc print data=xpt.keyres(obs=10);
;run;quit;
libname xpt clear;
%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas
-work h:\wrk));
options noxwait noxsync;
%let tym=%sysfunc(time());
systask kill sys1 sys2 sys3 sys4 sys5 sys6 sys7 sys8;
systask command "&_s -termstmt %nrstr(%getmode(1);) -log G:\wrk\sys1.log" taskname=sys1;
systask command "&_s -termstmt %nrstr(%getmode(2);) -log G:\wrk\sys2.log" taskname=sys2;
systask command "&_s -termstmt %nrstr(%getmode(3);) -log G:\wrk\sys3.log" taskname=sys3;
systask command "&_s -termstmt %nrstr(%getmode(4);) -log G:\wrk\sys4.log" taskname=sys4;
systask command "&_s -termstmt %nrstr(%getmode(5);) -log G:\wrk\sys5.log" taskname=sys5;
systask command "&_s -termstmt %nrstr(%getmode(6);) -log G:\wrk\sys6.log" taskname=sys6;
systask command "&_s -termstmt %nrstr(%getmode(7);) -log G:\wrk\sys7.log" taskname=sys7;
systask command "&_s -termstmt %nrstr(%getmode(8);) -log G:\wrk\sys8.log" taskname=sys8;
waitfor sys1 sys2 sys3 sys4 sys5 sys6 sys7 sys8;
%put %sysevalf( %sysfunc(time()) - &tym);
/*
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 4:15.10
Cumulative Cumulative
E1 E2 E3 Frequency Percent Frequency Percent
-------------------------------------------------------------------
1 2 . 80000 20.00 80000 20.00
1 2 3 200000 50.00 280000 70.00
2 2 1 40000 10.00 320000 80.00
2 2 3 80000 20.00 400000 100.00
*/
%macro vueapp(dummy);
%local i;
%do i=1 %to 8;
libname xpt&i xport "d:/xpt/xfr&i..xpt";
%end;
libname sd1 "d:/sd1";
proc datasets lib=sd1 mt=view;
delete keyresall;
;run;quit;
data sd1.keyresall/view=sd1.keyresall;
set
xpt1.keyres
xpt2.keyres
xpt3.keyres
xpt4.keyres
xpt5.keyres
xpt6.keyres
xpt7.keyres
xpt8.keyres
;
;run;quit;
proc freq data=sd1.keyresall;
tables e1*e2*e3 / list missing;
;run;quit;
%mend vueapp;
%vueapp;
libname xpt xport 'd:/xpt/xfr8.xpt';
proc print data=sd1.keyresall(where=(uniform(-1)<10/400000));
;run;quit;
libname xpt clear;
/*
KEY ID OBS E1 E2 E3 B0 B1 B2 B3 B4 B5 MODE_E MODE_B
12384 2 2 1 2 3 600 0 200 0 530 550 2 0
50429 2 1 1 2 3 600 0 200 0 530 550 2 0
54631 2 1 1 2 3 600 0 200 0 530 550 1 0
77263 2 1 1 2 3 600 0 200 0 530 550 2 0
77572 2 1 1 2 3 600 0 200 0 530 550 1 0
130353 3 2 2 2 3 150 500 . 120 3 58 2 3
233744 4 3 1 2 3 200 300 550 270 0 50 2 550
319192 4 1 1 2 3 200 300 550 270 0 50 1 200
*/
Computing the mode by row (combination of R and R)
HAVE
Need the mode for variables e1-e3 and b0-b5 in a new column (mode_e, mode_b).
If multimode, select one randomly.
If unique select select one randomly.
Keep all the variables in the final output.
The code must be able to compute this for large database.
The R code below can process about 100,000 records a minute
on my 2008 T7400(64gb ram) running 8 jobs in parallel.
10 minutes to do 1 million, 1000 minutes to 100 millon (still small scale).
Too long to get to big data (ie 10 billion +).
10 billion would take 100,000 minutes. (100000/1440) = 69 days.
SAS should be faster, but not as elgant(maybe IML)?.
The solutiion below will not scale to big data (ie 10 billion+ rows)
Maybe it will work work with millions of rows
* added a key - I like I primary key;
data "d:/sd1/df_a.sas7bdat" ;
retain id obs key 0;
input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
key=_n_;
datalines;
2 2 1 2 3 600 0 200 0 530 550
2 1 1 2 3 600 0 200 0 530 550
3 3 2 2 3 150 500 . 120 3 58
3 2 2 2 3 150 500 . 120 3 58
3 1 2 2 1 150 500 . 120 3 58
4 3 1 2 3 200 300 550 270 0 50
4 2 1 2 3 200 300 550 270 0 50
4 1 1 2 3 200 300 550 270 0 50
5 1 1 2 . 20 20 200 200 500 57
5 2 1 2 . 20 20 200 200 500 57
5 3 1 2 . 20 20 200 200 500 57
run;quit;
HAVE
/*
Up to 40 obs WC000001.DF_A total obs=11
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5
2 2 1 1 2 3 600 0 200 0 530 550
2 1 2 1 2 3 600 0 200 0 530 550
3 3 3 2 2 3 150 500 . 120 3 58
3 2 4 2 2 3 150 500 . 120 3 58
3 1 5 2 2 1 150 500 . 120 3 58
4 3 6 1 2 3 200 300 550 270 0 50
4 2 7 1 2 3 200 300 550 270 0 50
4 1 8 1 2 3 200 300 550 270 0 50
5 1 9 1 2 . 20 20 200 100 500 57
5 2 10 1 2 . 20 20 200 100 500 57
5 3 11 1 2 . 20 20 200 100 500 57
WANT
Up to 40 obs from xpt.keyres total obs=&tob
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5 MODE_E MODE_B
2 2 1 1 2 3 600 0 200 0 530 550 1 0 single mode
2 1 2 1 2 3 600 0 200 0 530 550 2 0 single mode
3 3 3 2 2 3 150 500 . 120 3 58 3 3 random
3 2 4 2 2 3 150 500 . 120 3 58 2 500 random
3 1 5 2 2 1 150 500 . 120 3 58 2 3 random
4 3 6 1 2 3 200 300 550 270 0 50 6 300 random
4 2 7 1 2 3 200 300 550 270 0 50 7 270 random
4 1 8 1 2 3 200 300 550 270 0 50 2 270 random
5 1 9 1 2 . 20 20 200 200 500 57 2 20 multimode(random)
5 2 10 1 2 . 20 20 200 200 500 57 2 200 multimode(random)
5 3 11 1 2 . 20 20 200 200 500 57 2 20 multimode(random)
*/
SOLUTION
/*
Using list lapply find max count of duplicates
Get the values corresponding to max counts
Sample random one of multiple modes
Sample random one if no mode (all values distinct)
Sample has an issue when there is only one valee so I use resamp
*/
%utl_submit_r(
library(haven);
library(SASxport);
datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
/* Begin R solution */
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
rmode <- function(x) {
x <- sort(x);
u <- unique(x);
if (length(x) == length(u)) {resamp(x)} else {
y <- lapply(u, function(y) length(x[x==y]));
resamp(u[which( unlist(y) == max(unlist(y)))]);
}
};
mode_e<-apply(datcut[,c(4:6)],1,rmode);
mode_b<-apply(datcut[,c(7:11)],1,rmode);
keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
keyres;
write.xport(keyres,file='d:/xpt/xfr&set..xpt');
));
/* R LOG
> library(haven); library(SASxport); datcut<-as.matrix(read_sas('d:/sd1/df_a.sas7bdat'));
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}}; rmode <- functi
on(x) { x <- sort(x); u <- unique(x); y <- lapply(u, function(y) length(x[x==y]));
u[which( unlist(y) == max(unlist(y)))]; }; mode_e<-unlist(lapply(apply(datcut[,c
(4:6)],1,rmode),resamp)); mode_b<-unlist(l
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5 mode_e mode_b
1 2 2 1 1 2 3 600 0 200 0 530 550 2 0
2 2 1 2 1 2 3 600 0 200 0 530 550 1 0
3 3 3 3 2 2 3 150 500 NA 120 3 58 2 150 Random
4 3 2 4 2 2 3 150 500 NA 120 3 58 2 120 Random
5 3 1 5 2 2 1 150 500 NA 120 3 58 2 500 Random
6 4 3 6 1 2 3 200 300 550 270 0 50 2 0 Random
7 4 2 7 1 2 3 200 300 550 270 0 50 1 270 Random
8 4 1 8 1 2 3 200 300 550 270 0 50 1 270 Random
9 5 1 9 1 2 NA 20 20 200 200 500 57 1 200 multimode
10 5 2 10 1 2 NA 20 20 200 200 500 57 2 20 multimode
11 5 3 11 1 2 NA 20 20 200 200 500 57 2 200 multimode
*/
libname xpt xport 'd:/xpt/xfr.xpt';
proc print data=xpt.keyres;
;run;quit;
libname xpt clear;
/*
SAS Proc print
ID OBS KEY E1 E2 E3 B0 B1 B2 B3 B4 B5 MODE_E MODE_B
2 2 1 1 2 3 600 0 200 0 530 550 2 0
2 1 2 1 2 3 600 0 200 0 530 550 1 0
3 3 3 2 2 3 150 500 . 120 3 58 2 150
3 2 4 2 2 3 150 500 . 120 3 58 2 120
3 1 5 2 2 1 150 500 . 120 3 58 2 500
4 3 6 1 2 3 200 300 550 270 0 50 2 0
4 2 7 1 2 3 200 300 550 270 0 50 1 270
4 1 8 1 2 3 200 300 550 270 0 50 1 270
5 1 9 1 2 . 20 20 200 100 500 57 1 200
5 2 10 1 2 . 20 20 200 100 500 57 2 20
5 3 11 1 2 . 20 20 200 100 500 57 2 200
*/
SCALE THE PROBLEM
* This solution only scales well for small data, in the millions;
* Create a very small dataset of 400,000 obs;
* We will run 8 parallel jobs so lets
split the 400,000 observations
into 8 pieces of 50,000 obs each;
* Create a small datasets about 400,000 obs;
* 400,000;
data "d:/wrk/datcut.sas7bdat";
retain key 0;
set "d:/sd1/df_a.sas7bdat"(drop=key obs=10);
do i=1 to 40000;
key=key+1;
output;
end;
drop i;
;run;quit;
/*
Up to 40 obs "d:/wrk/datcut.sas7bdat" total obs=400,000
KEY ID OBS E1 E2 E3 B0 B1 B2 B3 B4 B5
1 2 2 1 2 3 600 0 200 0 530 550
2 2 2 1 2 3 600 0 200 0 530 550
3 2 2 1 2 3 600 0 200 0 530 550
4 2 2 1 2 3 600 0 200 0 530 550
5 2 2 1 2 3 600 0 200 0 530 550
6 2 2 1 2 3 600 0 200 0 530 550
7 2 2 1 2 3 600 0 200 0 530 550
8 2 2 1 2 3 600 0 200 0 530 550
*/
%macro getmode(set,fyl=d:/wrk/datcut.sas7bdat,span=40000);
/*
%let set=5;
%let fyl=d:/wrk/datcut.sas7bdat;
%let span=5000;
*/
%local b e;
%let e=%sysevalf(&set * &span );
%let b=%sysevalf((&set - 1)*&span +1);
%put Stary &=b;
%put END &=e;
* grab 50,000 subset;
data "d:/wrk/subset&set..sas7bdat";
set "&fyl"(firstobs=&b obs=&e);
run;quit;
%utl_submit_r(
library(haven);
library(SASxport);
datcut<-as.matrix(read_sas('d:/wrk/subset&set..sas7bdat'));
/* Begin R solution */
resamp <- function(x){if (length(x)==1) {x} else {sample(x,1)}};
rmode <- function(x) {
x <- sort(x);
u <- unique(x);
if (length(x) == length(u)) {resamp(x)} else {
y <- lapply(u, function(y) length(x[x==y]));
resamp(u[which( unlist(y) == max(unlist(y)))]);
}
};
mode_e<-apply(datcut[,c(4:6)],1,rmode);
mode_b<-apply(datcut[,c(7:12)],1,rmode);
keyres=as.data.frame(cbind(datcut,as.data.frame(cbind(mode_e,mode_b))));
write.xport(keyres,file='d:/xpt/xfr&set..xpt');
);
%mend getmode;
* if you tested with let statements inside macro (creates globals);
%symdel set b e fyl;
%getmode(3);
libname xpt xport 'd:/xpt/xfr3.xpt';
proc print data=xpt.keyres(obs=10);
;run;quit;
libname xpt clear;
%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas
-work h:\wrk));
options noxwait noxsync;
%let tym=%sysfunc(time());
systask kill sys1 sys2 sys3 sys4 sys5 sys6 sys7 sys8;
systask command "&_s -termstmt %nrstr(%getmode(1);) -log G:\wrk\sys1.log" taskname=sys1;
systask command "&_s -termstmt %nrstr(%getmode(2);) -log G:\wrk\sys2.log" taskname=sys2;
systask command "&_s -termstmt %nrstr(%getmode(3);) -log G:\wrk\sys3.log" taskname=sys3;
systask command "&_s -termstmt %nrstr(%getmode(4);) -log G:\wrk\sys4.log" taskname=sys4;
systask command "&_s -termstmt %nrstr(%getmode(5);) -log G:\wrk\sys5.log" taskname=sys5;
systask command "&_s -termstmt %nrstr(%getmode(6);) -log G:\wrk\sys6.log" taskname=sys6;
systask command "&_s -termstmt %nrstr(%getmode(7);) -log G:\wrk\sys7.log" taskname=sys7;
systask command "&_s -termstmt %nrstr(%getmode(8);) -log G:\wrk\sys8.log" taskname=sys8;
waitfor sys1 sys2 sys3 sys4 sys5 sys6 sys7 sys8;
%put %sysevalf( %sysfunc(time()) - &tym);
/*
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 4:15.10
Cumulative Cumulative
E1 E2 E3 Frequency Percent Frequency Percent
-------------------------------------------------------------------
1 2 . 80000 20.00 80000 20.00
1 2 3 200000 50.00 280000 70.00
2 2 1 40000 10.00 320000 80.00
2 2 3 80000 20.00 400000 100.00
*/
%macro vueapp(dummy);
%local i;
%do i=1 %to 8;
libname xpt&i xport "d:/xpt/xfr&i..xpt";
%end;
libname sd1 "d:/sd1";
proc datasets lib=sd1 mt=view;
delete keyresall;
;run;quit;
data sd1.keyresall/view=sd1.keyresall;
set
xpt1.keyres
xpt2.keyres
xpt3.keyres
xpt4.keyres
xpt5.keyres
xpt6.keyres
xpt7.keyres
xpt8.keyres
;
;run;quit;
proc freq data=sd1.keyresall;
tables e1*e2*e3 / list missing;
;run;quit;
%mend vueapp;
%vueapp;
libname xpt xport 'd:/xpt/xfr8.xpt';
proc print data=sd1.keyresall(where=(uniform(-1)<10/400000));
;run;quit;
libname xpt clear;
/*
KEY ID OBS E1 E2 E3 B0 B1 B2 B3 B4 B5 MODE_E MODE_B
12384 2 2 1 2 3 600 0 200 0 530 550 2 0
50429 2 1 1 2 3 600 0 200 0 530 550 2 0
54631 2 1 1 2 3 600 0 200 0 530 550 1 0
77263 2 1 1 2 3 600 0 200 0 530 550 2 0
77572 2 1 1 2 3 600 0 200 0 530 550 1 0
130353 3 2 2 2 3 150 500 . 120 3 58 2 3
233744 4 3 1 2 3 200 300 550 270 0 50 2 550
319192 4 1 1 2 3 200 300 550 270 0 50 1 200
*/
You have two category of variables, so you need run the code two times.
data have;
input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
datalines;
2 2 1 2 3 600 0 200 0 530 550
2 1 1 2 3 600 0 200 0 530 550
3 3 2 2 3 150 500 . 120 3 58
3 2 2 2 3 150 500 . 120 3 58
3 1 2 2 1 150 500 . 120 3 58
4 3 1 2 3 200 300 550 270 0 50
4 2 1 2 3 200 300 550 270 0 50
4 1 1 2 3 200 300 550 270 0 50
5 1 1 2 . 20 20 200 100 500 57
5 2 1 2 . 20 20 200 100 500 57
5 3 1 2 . 20 20 200 100 500 57
;
run;
data want(drop=k count i max ran);
if _n_ eq 1 then do;
declare hash ha();
declare hiter hi('ha');
ha.definekey('k');
ha.definedata('k','count');
ha.definedone();
declare hash random(ordered:'a');
declare hiter hir('random');
random.definekey('ran');
random.definedata('k','count');
random.definedone();
end;
set have;
array x{*} e: ;
do i=1 to dim(x);
if not missing(x{i}) then do;
k=x{i};
if ha.find()=0 then do;count+1;ha.replace();end;
else do;count=1;ha.add();end;
end;
end;
do while(hi.next()=0);
ran=ranuni(-1);random.add();
end;
max=0;
do while(hir.next()=0);
if count gt max then do;max=count;e_mod=k;end;
end;
ha.clear(); random.clear();
array y{*} b: ;
do i=1 to dim(y);
if not missing(y{i}) then do;
k=y{i};
if ha.find()=0 then do;count+1;ha.replace();end;
else do;count=1;ha.add();end;
end;
end;
do while(hi.next()=0);
ran=ranuni(-1);random.add();
end;
max=0;
do while(hir.next()=0);
if count gt max then do;max=count;b_mod=k;end;
end;
ha.clear(); random.clear();
run;
Hi Team
There was atype in my previous heading, should have been R and SAS not R and R.
I ran Xia's code and it took about 6 seconds for 400,000 records. Much faster than R.
Mine took 4 minutes, running in parallel.
The output was the same as mine.
The FREQ Procedure
Cumulative Cumulative
E1 E2 E3 Frequency Percent Frequency Percent
-------------------------------------------------------------------
1 2 . 80000 20.00 80000 20.00
1 2 3 200000 50.00 280000 70.00
2 2 1 40000 10.00 320000 80.00
2 2 3 80000 20.00 400000 100.00
Not sure why I am getting double posts
I will try to explain Xias astonishing fast
mode algoritm
Lets explain how we arrive at B_MOD
O B0 B1 B2 B3 B4 B5 B_MOD
1 20 20 200 200 300 300 20
2 20 20 200 200 300 300 200
3 20 20 200 200 300 300 300
OBSERVATION 1 B_MOD=20
* first we examine each of the Bs against
all other Bs, but add only when we exceed a
previous count.
b iter COUNT=1 K=20 20 20 200 200 300 300
b iter COUNT=2 K=20 20 20 200 200 300 300
b iter COUNT=1 K=200 20 20 200 200 300 300
b iter COUNT=2 K=200 20 20 200 200 300 300
b iter COUNT=1 K=300 20 20 200 200 300 300
b iter COUNT=2 K=300 20 20 200 200 300 300
* final analysis we yhe max count of 2 for three pairs;
final b COUNT=2 K=300 20 20 200 200 300 300
* nest we generate a random sequence of uniform random numbers;
ran iter b RAN=0.9850622509 K=200 COUNT=2
ran iter b RAN=0.8866575662 K=300 COUNT=2
ran iter b RAN=0.2885982167 K=20 COUNT=2
iter max count K=20 MAX=2 B_MOD=20
iter max count K=300 MAX=2 B_MOD=20
iter max count K=200 MAX=2 B_MOD=20
find max count K=200 MAX=2 B_MOD=20
* they all have max=2 so we take
the one with the lowest random number
Remember ran is ordered
RAN=0.2885982167 K=20
ie B_MOD=20
OBSERVATION 2 B_MOD=200
Same as above
b iter COUNT=1 K=20 20 20 200 200 300 300
b iter COUNT=2 K=20 20 20 200 200 300 300
b iter COUNT=1 K=200 20 20 200 200 300 300
b iter COUNT=2 K=200 20 20 200 200 300 300
b iter COUNT=1 K=300 20 20 200 200 300 300
b iter COUNT=2 K=300 20 20 200 200 300 300
final b COUNT=2 K=300 20 20 200 200 300 300
ran iter b RAN=0.0071778353 K=200 COUNT=2
ran iter b RAN=0.5806072809 K=300 COUNT=2
ran iter b RAN=0.9682415142 K=20 COUNT=2
iter max count K=200 MAX=2 B_MOD=200
iter max count K=300 MAX=2 B_MOD=200
iter max count K=20 MAX=2 B_MOD=200
find max count K=20 MAX=2 B_MOD=200
* they all have max=2 so we take
the one with the lowest random number
Remember ran is ordered
RAN=0.0071778353 K=200
ie B_MOD=200
b iter COUNT=1 K=20 20 20 200 200 300 300
b iter COUNT=2 K=20 20 20 200 200 300 300
b iter COUNT=1 K=200 20 20 200 200 300 300
b iter COUNT=2 K=200 20 20 200 200 300 300
b iter COUNT=1 K=300 20 20 200 200 300 300
b iter COUNT=2 K=300 20 20 200 200 300 300
final b COUNT=2 K=300 20 20 200 200 300 300
ran iter b RAN=0.8807552 K=200 COUNT=2
ran iter b RAN=0.2498995183 K=300 COUNT=2
ran iter b RAN=0.7680367258 K=20 COUNT=2
iter max count K=300 MAX=2 B_MOD=300
iter max count K=20 MAX=2 B_MOD=300
iter max count K=200 MAX=2 B_MOD=300
find max count K=200 MAX=2 B_MOD=300
* they all have max=2 so we take
the one with the lowest random number
Remember ran is ordered
RAN=0.2498995183 K=300
ie B_MOD=300
%macro repmode(dummy);
%let cmt=%str(*);
%do i=1 %to 3;
data want(drop=k count i max ran);
/* setup */
if _n_ eq 1 then do;
declare hash ha ();
declare hiter hi('ha');
ha.definekey ('k');
ha.definedata ('k','count');
ha.definedone ();
declare hash random(ordered:'a');
declare hiter hir ('random');
random.definekey ('ran');
random.definedata ('k','count');
random.definedone ();
end;
set have(obs=1);
array x{*} e: ;
do i=1 to dim(x);
if not missing(x{i}) then do;
k=x{i};
if ha.find()=0 then do;
count+1;
ha.replace();
end;
else do;
count=1;
ha.add();
end;
end;
&cmt put 'e iter ' count= e1= e2= e3=;
end;
&cmt put 'final e ' count= e1= e2= e3=;
/*
As it iterates through the e array it
checks for for duplicates and
updates the counter count
Input
E1 E2 E3
2 2 2
e iter COUNT=1 E1=2 E2=2 E3=2
e iter COUNT=2 E1=2 E2=2 E3=2
e iter COUNT=3 E1=2 E2=2 E3=2
final e COUNT=3 E1=2 E2=2 E3=2
*/
do while(hi.next()=0);
ran=ranuni(-1);
random.add();
&cmt put 'ran ' ran=;
end;
/*
ran RAN=0.5100509643
*/
max=0;
do while(hir.next()=0);
if count gt max then do;
max=count;
e_mod=k;
end;
end;
&cmt put 'find max count ' max= e_mod=;
/*
we have the value and the max;
find max count MAX=3 E_MOD=2
*/
ha.clear();
random.clear();
* mobe onto the b array;
array y{*} b: ;
do i=1 to dim(y);
if not missing(y{i}) then do;
k=y{i};
if ha.find()=0 then do;
count+1;
ha.replace();
end;
else do;
count=1;
ha.add();
end;
end;
put 'b iter ' count= k= b0 b1 b2 b3 b4 b5;
end;
put 'final b ' count= k= b0 b1 b2 b3 b4 b5;
/*
INPUT
B0 B1 B2 B3 B4 B5
20 20 200 200 500 57
b iter COUNT=1 20 20 200 200 500 57
b iter COUNT=2 20 20 200 200 500 57 The two 20s
b iter COUNT=1 20 20 200 200 500 57
b iter COUNT=2 20 20 200 200 500 57 The two 200s
b iter COUNT=1 20 20 200 200 500 57
b iter COUNT=1 20 20 200 200 500 57
final b COUNT=1 20 20 200 200 500 57
*/
do while(hi.next()=0);
ran=ranuni(-1);
random.add();
put 'ran iter b ' ran= k= count=;
end;
put 'ran fin b ' ran= k= count=;
/*
ran iter b RAN=0.5936162847
ran iter b RAN=0.5504719776
ran iter b RAN=0.1298635263
ran iter b RAN=0.2945035129
ran fin b RAN=0.2945035129
*/
max=0;
do while(hir.next()=0);
if count gt max then do;
max=count;
b_mod=k;
end;
put 'iter max count ' k= max= b_mod= ran=;
end;
put 'find max count ' k= max= b_mod= rsn=;
ha.clear();
random.clear();
run;
proc append data=want base=wants;
;run;quit;
%end;
%mend repmode;
proc datasets lib=work;
delete wants;
;run;quit;
%utlnopts;
%repmode;
Roger,
I want you test the following IML code if it would be faster than data step.
data have;
input id obs e1 e2 e3 b0 b1 b2 b3 b4 b5;
datalines;
2 2 1 2 3 600 0 200 0 530 550
2 1 1 2 3 600 0 200 0 530 550
3 3 2 2 3 150 500 . 120 3 58
3 2 2 2 3 150 500 . 120 3 58
3 1 2 2 1 150 500 . 120 3 58
4 3 1 2 3 200 300 550 270 0 50
4 2 1 2 3 200 300 550 270 0 50
4 1 1 2 3 200 300 550 270 0 50
5 1 1 2 . 20 20 200 100 500 57
5 2 1 2 . 20 20 200 100 500 57
5 3 1 2 . 20 20 200 100 500 57
;
run;
proc iml;
use have nobs nobs;
read all var {e1 e2 e3} into x;
read all var {b0 b1 b2 b3 b4 b5} into y;
close;
mode_e=j(nobs,1,.);
do i=1 to nobs;
temp=x[i,];
call tabulate(levels,freq,temp);
max_idx=loc(freq=freq[<>]);
want_idx=sample(max_idx,1);
mode_e[i]=levels[want_idx];
end;
mode_b=j(nobs,1,.);
do i=1 to nobs;
temp=y[i,];
call tabulate(levels,freq,temp);
max_idx=loc(freq=freq[<>]);
want_idx=sample(max_idx,1);
mode_b[i]=levels[want_idx];
end;
create mode var {mode_e mode_b};
append;
close;
quit;
data want;
merge have mode;
run;
proc print noobs;run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.