Solved: Re: programming help

jfaruqui · Posted 10-11-2017 04:18 AM

so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly) .. Each variable having a 2 digit value. For example:

C1 C2 C3 C4 C5 .. ..

10 12 15 55 15 .. ..

10 10 15 10 12 .. ..

10 10 10 10 10 .. ..

10 10 12 12 15 .. ..

.. .. .. .. .. .. ..

Now I want to make and populate a result variable which basically read each row and outputs the value that has the most occurrences. For example:

C1 C2 C3 C4 C5 .. .. RESULT

10 12 15 55 15 .. .. 15

10 10 15 10 12 .. .. 10

10 10 10 10 10 .. .. 10

10 10 12 12 15 .. .. 12 (higher value in case of a tie)

.. .. .. .. .. .. ..

What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.

Thanks in advance !!!

gamotte · Posted 10-11-2017 04:58 AM

Hello,

data want;
	set have;
	array C(*) C:;
	array a(100) a100-a1;

	drop i a:;

	call missing(of a(*));
	do i=1 to dim(C);
		a(100-C(i))+1;
	end;

	idx=100-whichn(max(of a(*)), of a(*));
run;

View solution in original post

Kurt_Bremser · Posted 10-11-2017 04:24 AM

First step: transpose from a wide to a long format:

10 12 15 55 15

to

sort (within your groups that identify original lines)

Now you can get a count with (eg) proc summary, and retrieve the highest count for each group.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jfaruqui · Posted 10-11-2017 04:31 AM

Hi Kurt ..

transpose is out of the question as there are millions of rows (which would be turned into columns in a transpose) .. i could go on a nice long vacation while the program runs 🙂 .. can't transpose .. tried it with a small subset of the data (20,000 rows) .. got tired of twiddling my thumbs :))

gamotte · Posted 10-11-2017 04:58 AM

Hello,

data want;
	set have;
	array C(*) C:;
	array a(100) a100-a1;

	drop i a:;

	call missing(of a(*));
	do i=1 to dim(C);
		a(100-C(i))+1;
	end;

	idx=100-whichn(max(of a(*)), of a(*));
run;

jfaruqui · Posted 10-11-2017 05:23 AM

Hi Gamotte

Thanks so much for this awesome solution ..

can you please explain the significance of '100' and the 'call missing' ??

thanx

Kurt_Bremser · Posted 10-11-2017 05:35 AM

@jfaruqui wrote:

Hi Gamotte

Thanks so much for this awesome solution ..

can you please explain the significance of '100' and the 'call missing' ??

thanx

call missing() initializes the new array, which is necessary, because arrays are automatically retained.

@gamotte's program needs to know the number of variables beforehand, so that's the reason for using 100. But you can get that value by running a preliminary step:

data _null_;
set have;
array vars{*} C:;
call symput('nvars',put(dim(vars),best.));
stop;
run;

so you can use &nvars in the next step in place of a numeric constant.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

gamotte · Posted 10-11-2017 05:40 AM

call missing is used to reinitialize the array for each observation.

The idea is to create an array of size 100 which will store counters for each possible values of

the C<i> variables. Since you said that they are on two digits, possible values are 00 to 99.

When C<i>=10 for instance the counter a10 is increased.

Since you added the condition that incase of tie, the maximum shoud be retained and

because the whichn function returns the first match, I reversed the order of columns :

a100 to a1 instead of a1 to a100.

Kurt_Bremser · Posted 10-11-2017 07:27 AM

A transpose of 20000 recs is lightning fast. If you actually found the time to twiddle your thumbs, you should replace your 20-year old computer with something more modern:

%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
v_&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,20000)

proc transpose data=have out=have1 (rename=(col1=v));
by id;
var v_:;
run;

Log:

24         %macro make_have(no_vars,no_recs);
25         data have;
26         do id = 1 to &no_recs;
27         %do i = 1 %to &no_vars;
28         v_&i = int(rand('uniform')*55);
29         %end;
30         output;
31         end;
32         run;
33         %mend;
34         %make_have(200,20000)

NOTE: The data set WORK.HAVE has 20000 observations and 201 variables.
NOTE: DATA statement used (Total process time):
      real time           1.15 seconds
      cpu time            0.18 seconds
      

35         
36         proc transpose data=have out=have1 (rename=(col1=v));
37         by id;
38         var v_:;
39         run;

NOTE: There were 20000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE1 has 4000000 observations and 3 variables.
NOTE: PROZEDUR TRANSPOSE used (Total process time):
      real time           2.10 seconds
      cpu time            0.34 seconds

Note that's from a 2-core server, something considered very low in computing power for SAS.

Or you made an error in writing your transpose.

My suggested solution looks like this:

%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
c&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,1000000)

proc transpose data=have out=have1 (drop=_name_ rename=(col1=c));
by id;
var c:;
run;

proc sort data=have1;
by id c;
run;

data count;
set have1;
by id c;
if first.c
then count = 1;
else count + 1;
if last.c then output;
run;

proc sort data=count;
by id descending count descending c;
run;

data want;
set count;
by id;
if first.id;
run;

The whole test for 200 variables and 1 million records took this:

24         %macro make_have(no_vars,no_recs);
25         data have;
26         do id = 1 to &no_recs;
27         %do i = 1 %to &no_vars;
28         c&i = int(rand('uniform')*55);
29         %end;
30         output;
31         end;
32         run;
33         %mend;
34         %make_have(200,1000000)

NOTE: The data set WORK.HAVE has 1000000 observations and 201 variables.
NOTE: DATA statement used (Total process time):
      real time           19.98 seconds
      cpu time            10.29 seconds
      

35         
36         proc transpose data=have out=have1 (drop=_name_ rename=(col1=c));
37         by id;
38         var c:;
39         run;

NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables.
NOTE: PROZEDUR TRANSPOSE used (Total process time):
      real time           31.36 seconds
      cpu time            14.90 seconds
      

40         
41         proc sort data=have1;
42         by id c;
43         run;

NOTE: There were 200000000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           1:28.38
      cpu time            51.98 seconds
      

44         
45         data count;
46         set have1;
47         by id c;
48         if first.c
49         then count = 1;
50         else count + 1;
51         if last.c then output;
52         run;

NOTE: There were 200000000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           1:09.82
      cpu time            26.76 seconds
      

53         
54         proc sort data=count;
55         by id descending count descending c;
56         run;

NOTE: There were 53600561 observations read from the data set WORK.COUNT.
NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           35.41 seconds
      cpu time            16.39 seconds
      

57         
58         data want;
59         set count;
60         by id;
61         if first.id;
62         run;

NOTE: There were 53600561 observations read from the data set WORK.COUNT.
NOTE: The data set WORK.WANT has 1000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           14.40 seconds
      cpu time            4.04 seconds

At 31 seconds, you can see that the transpose will not pose a problem.

Alternatively, one can use a proc summary after the transpose:

24         proc summary data=have1 n nway;
25         by id;
26         class c;
27         var c;
28         output out=count (drop=_type_ c rename=(_freq_=count n_c=c)) n=n_c;
29         run;

NOTE: There were 200000000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables.
NOTE: PROZEDUR SUMMARY used (Total process time):
      real time           1:27.70
      cpu time            48.06 seconds

so you get a runtime of ~ 4 minutes for 1 million records with 200 variables

Note that the program (especially the datastep/sort version) makes no assumptions for

- number of variables

- number of possible values

- variable type

and is therefore completely data-driven.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 10-11-2017 10:29 AM

@jfaruqui wrote:

so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly) ..

What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.

Almost any process that continually adds variables is poorly designed. Having to change code because of constantly added variables that are essentially the same as current ones means that the advantage of computers doing the routine and tedious work is partially negated.

Typically this approach seems to come from a spreadsheet oriented solution that is then "automated with manual changes" to account for poor data structures. There is a large body of work related to normalizing data for automated processing that may be of interest.

Ksharp · Posted 10-11-2017 10:53 AM

How about this one ?

data have;
input C1 C2 C3 C4 C5;
cards;
10 12 15 55 15
10 10 15 10 12
10 10 10 10 10
10 10 12 12 15 
;
run;
data want;
 if _n_=1 then do;
   declare hash h(ordered:'a');
   declare hiter hi('h');
   h.definekey('k');
   h.definedata('k','count');
   h.definedone();
 end;
set have;
array x{*} c1-c5;
do i=1 to dim(x);
 k=x{i};
 rc=h.find();
 if rc=0 then do;count=count+1;h.replace();end;
  else do;count=1;h.add();end;
end;
do while(hi.next()=0);
  if count>=max then do;max=count;RESULT=k; end;
end;
h.clear();
drop i k rc max count;
run;

P.S. If you can use SAS/IML, that could fast you a lot more and save you a lot code.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away