so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly) .. Each variable having a 2 digit value. For example:
C1 C2 C3 C4 C5 .. ..
10 12 15 55 15 .. ..
10 10 15 10 12 .. ..
10 10 10 10 10 .. ..
10 10 12 12 15 .. ..
.. .. .. .. .. .. ..
.. .. .. .. .. .. ..
Now I want to make and populate a result variable which basically read each row and outputs the value that has the most occurrences. For example:
C1 C2 C3 C4 C5 .. .. RESULT
10 12 15 55 15 .. .. 15
10 10 15 10 12 .. .. 10
10 10 10 10 10 .. .. 10
10 10 12 12 15 .. .. 12 (higher value in case of a tie)
.. .. .. .. .. .. ..
.. .. .. .. .. .. ..
What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.
Thanks in advance !!!
Hello,
data want;
set have;
array C(*) C:;
array a(100) a100-a1;
drop i a:;
call missing(of a(*));
do i=1 to dim(C);
a(100-C(i))+1;
end;
idx=100-whichn(max(of a(*)), of a(*));
run;
First step: transpose from a wide to a long format:
10 12 15 55 15
to
10 12 15 55 15
sort (within your groups that identify original lines)
10 12 15 15 55
Now you can get a count with (eg) proc summary, and retrieve the highest count for each group.
Hi Kurt ..
transpose is out of the question as there are millions of rows (which would be turned into columns in a transpose) .. i could go on a nice long vacation while the program runs 🙂 .. can't transpose .. tried it with a small subset of the data (20,000 rows) .. got tired of twiddling my thumbs :))
Hello,
data want;
set have;
array C(*) C:;
array a(100) a100-a1;
drop i a:;
call missing(of a(*));
do i=1 to dim(C);
a(100-C(i))+1;
end;
idx=100-whichn(max(of a(*)), of a(*));
run;
Hi Gamotte
Thanks so much for this awesome solution ..
can you please explain the significance of '100' and the 'call missing' ??
thanx
@jfaruqui wrote:
Hi Gamotte
Thanks so much for this awesome solution ..
can you please explain the significance of '100' and the 'call missing' ??
thanx
call missing() initializes the new array, which is necessary, because arrays are automatically retained.
@gamotte's program needs to know the number of variables beforehand, so that's the reason for using 100. But you can get that value by running a preliminary step:
data _null_;
set have;
array vars{*} C:;
call symput('nvars',put(dim(vars),best.));
stop;
run;
so you can use &nvars in the next step in place of a numeric constant.
call missing is used to reinitialize the array for each observation.
The idea is to create an array of size 100 which will store counters for each possible values of
the C<i> variables. Since you said that they are on two digits, possible values are 00 to 99.
When C<i>=10 for instance the counter a10 is increased.
Since you added the condition that incase of tie, the maximum shoud be retained and
because the whichn function returns the first match, I reversed the order of columns :
a100 to a1 instead of a1 to a100.
A transpose of 20000 recs is lightning fast. If you actually found the time to twiddle your thumbs, you should replace your 20-year old computer with something more modern:
%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
v_&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,20000)
proc transpose data=have out=have1 (rename=(col1=v));
by id;
var v_:;
run;
Log:
24 %macro make_have(no_vars,no_recs); 25 data have; 26 do id = 1 to &no_recs; 27 %do i = 1 %to &no_vars; 28 v_&i = int(rand('uniform')*55); 29 %end; 30 output; 31 end; 32 run; 33 %mend; 34 %make_have(200,20000) NOTE: The data set WORK.HAVE has 20000 observations and 201 variables. NOTE: DATA statement used (Total process time): real time 1.15 seconds cpu time 0.18 seconds 35 36 proc transpose data=have out=have1 (rename=(col1=v)); 37 by id; 38 var v_:; 39 run; NOTE: There were 20000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.HAVE1 has 4000000 observations and 3 variables. NOTE: PROZEDUR TRANSPOSE used (Total process time): real time 2.10 seconds cpu time 0.34 seconds
Note that's from a 2-core server, something considered very low in computing power for SAS.
Or you made an error in writing your transpose.
My suggested solution looks like this:
%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
c&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,1000000)
proc transpose data=have out=have1 (drop=_name_ rename=(col1=c));
by id;
var c:;
run;
proc sort data=have1;
by id c;
run;
data count;
set have1;
by id c;
if first.c
then count = 1;
else count + 1;
if last.c then output;
run;
proc sort data=count;
by id descending count descending c;
run;
data want;
set count;
by id;
if first.id;
run;
The whole test for 200 variables and 1 million records took this:
24 %macro make_have(no_vars,no_recs); 25 data have; 26 do id = 1 to &no_recs; 27 %do i = 1 %to &no_vars; 28 c&i = int(rand('uniform')*55); 29 %end; 30 output; 31 end; 32 run; 33 %mend; 34 %make_have(200,1000000) NOTE: The data set WORK.HAVE has 1000000 observations and 201 variables. NOTE: DATA statement used (Total process time): real time 19.98 seconds cpu time 10.29 seconds 35 36 proc transpose data=have out=have1 (drop=_name_ rename=(col1=c)); 37 by id; 38 var c:; 39 run; NOTE: There were 1000000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables. NOTE: PROZEDUR TRANSPOSE used (Total process time): real time 31.36 seconds cpu time 14.90 seconds 40 41 proc sort data=have1; 42 by id c; 43 run; NOTE: There were 200000000 observations read from the data set WORK.HAVE1. NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables. NOTE: PROZEDUR SORT used (Total process time): real time 1:28.38 cpu time 51.98 seconds 44 45 data count; 46 set have1; 47 by id c; 48 if first.c 49 then count = 1; 50 else count + 1; 51 if last.c then output; 52 run; NOTE: There were 200000000 observations read from the data set WORK.HAVE1. NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 1:09.82 cpu time 26.76 seconds 53 54 proc sort data=count; 55 by id descending count descending c; 56 run; NOTE: There were 53600561 observations read from the data set WORK.COUNT. NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables. NOTE: PROZEDUR SORT used (Total process time): real time 35.41 seconds cpu time 16.39 seconds 57 58 data want; 59 set count; 60 by id; 61 if first.id; 62 run; NOTE: There were 53600561 observations read from the data set WORK.COUNT. NOTE: The data set WORK.WANT has 1000000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 14.40 seconds cpu time 4.04 seconds
At 31 seconds, you can see that the transpose will not pose a problem.
Alternatively, one can use a proc summary after the transpose:
24 proc summary data=have1 n nway; 25 by id; 26 class c; 27 var c; 28 output out=count (drop=_type_ c rename=(_freq_=count n_c=c)) n=n_c; 29 run; NOTE: There were 200000000 observations read from the data set WORK.HAVE1. NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables. NOTE: PROZEDUR SUMMARY used (Total process time): real time 1:27.70 cpu time 48.06 seconds
so you get a runtime of ~ 4 minutes for 1 million records with 200 variables
Note that the program (especially the datastep/sort version) makes no assumptions for
- number of variables
- number of possible values
- variable type
and is therefore completely data-driven.
@jfaruqui wrote:
so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly) ..
What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.
Almost any process that continually adds variables is poorly designed. Having to change code because of constantly added variables that are essentially the same as current ones means that the advantage of computers doing the routine and tedious work is partially negated.
Typically this approach seems to come from a spreadsheet oriented solution that is then "automated with manual changes" to account for poor data structures. There is a large body of work related to normalizing data for automated processing that may be of interest.
How about this one ?
data have;
input C1 C2 C3 C4 C5;
cards;
10 12 15 55 15
10 10 15 10 12
10 10 10 10 10
10 10 12 12 15
;
run;
data want;
if _n_=1 then do;
declare hash h(ordered:'a');
declare hiter hi('h');
h.definekey('k');
h.definedata('k','count');
h.definedone();
end;
set have;
array x{*} c1-c5;
do i=1 to dim(x);
k=x{i};
rc=h.find();
if rc=0 then do;count=count+1;h.replace();end;
else do;count=1;h.add();end;
end;
do while(hi.next()=0);
if count>=max then do;max=count;RESULT=k; end;
end;
h.clear();
drop i k rc max count;
run;
P.S. If you can use SAS/IML, that could fast you a lot more and save you a lot code.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.