BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jfaruqui
Obsidian | Level 7

so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly)  .. Each variable having a 2 digit value. For example:

 

C1 C2 C3 C4 C5 .. ..

10 12 15 55 15 .. ..

10 10 15 10 12 .. ..

10 10 10 10 10 .. ..

10 10 12 12 15 .. ..

.. .. .. .. .. .. .. 

.. .. .. .. .. .. ..

 

Now I want to make and populate a result variable which basically read each row and outputs the value that has the most occurrences. For example:

 

C1 C2 C3 C4 C5 .. .. RESULT

10 12 15 55 15 .. ..   15

10 10 15 10 12 .. ..   10

10 10 10 10 10 .. ..   10

10 10 12 12 15 .. ..   12 (higher value in case of a tie)

.. .. .. .. .. .. .. 

.. .. .. .. .. .. ..

 

What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.

 

Thanks in advance !!!

1 ACCEPTED SOLUTION

Accepted Solutions
gamotte
Rhodochrosite | Level 12

Hello,

 

data want;
	set have;
	array C(*) C:;
	array a(100) a100-a1;

	drop i a:;

	call missing(of a(*));
	do i=1 to dim(C);
		a(100-C(i))+1;
	end;

	idx=100-whichn(max(of a(*)), of a(*));
run;

View solution in original post

9 REPLIES 9
Kurt_Bremser
Super User

First step: transpose from a wide to a long format:

 

10 12 15 55 15

to

10
12
15
55
15

sort (within your groups that identify original lines)

10
12
15
15
55

Now you can get a count with (eg) proc summary, and retrieve the highest count for each group.

jfaruqui
Obsidian | Level 7

Hi Kurt ..

 

transpose is out of the question as there are millions of rows (which would be turned into columns in a transpose) .. i could go on a nice long vacation while the program runs 🙂 .. can't transpose .. tried it with a small subset of the data (20,000 rows) .. got tired of twiddling my thumbs :))

gamotte
Rhodochrosite | Level 12

Hello,

 

data want;
	set have;
	array C(*) C:;
	array a(100) a100-a1;

	drop i a:;

	call missing(of a(*));
	do i=1 to dim(C);
		a(100-C(i))+1;
	end;

	idx=100-whichn(max(of a(*)), of a(*));
run;
jfaruqui
Obsidian | Level 7

Hi Gamotte 

 

Thanks so much for this awesome solution ..

can you please explain the significance of '100' and the 'call missing' ??

 

thanx

Kurt_Bremser
Super User

@jfaruqui wrote:

Hi Gamotte 

 

Thanks so much for this awesome solution ..

can you please explain the significance of '100' and the 'call missing' ??

 

thanx


call missing() initializes the new array, which is necessary, because arrays are automatically retained.

 

@gamotte's program needs to know the number of variables beforehand, so that's the reason for using 100. But you can get that value by running a preliminary step:

data _null_;
set have;
array vars{*} C:;
call symput('nvars',put(dim(vars),best.));
stop;
run;

so you can use &nvars in the next step in place of a numeric constant.

gamotte
Rhodochrosite | Level 12

call missing is used to reinitialize the array for each observation.

 

The idea is to create an array of size 100 which will store counters for each possible values of

the C<i> variables. Since you said that they are on two digits, possible values are 00 to 99.

When C<i>=10 for instance the counter a10 is increased.

 

Since you added the condition that incase of tie, the maximum shoud be retained and

because the whichn function returns the first match, I reversed the order of columns :

a100 to a1 instead of a1 to a100.

Kurt_Bremser
Super User

A transpose of 20000 recs is lightning fast. If you actually found the time to twiddle your thumbs, you should replace your 20-year old computer with something more modern:

%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
v_&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,20000)

proc transpose data=have out=have1 (rename=(col1=v));
by id;
var v_:;
run;

Log:

24         %macro make_have(no_vars,no_recs);
25         data have;
26         do id = 1 to &no_recs;
27         %do i = 1 %to &no_vars;
28         v_&i = int(rand('uniform')*55);
29         %end;
30         output;
31         end;
32         run;
33         %mend;
34         %make_have(200,20000)

NOTE: The data set WORK.HAVE has 20000 observations and 201 variables.
NOTE: DATA statement used (Total process time):
      real time           1.15 seconds
      cpu time            0.18 seconds
      

35         
36         proc transpose data=have out=have1 (rename=(col1=v));
37         by id;
38         var v_:;
39         run;

NOTE: There were 20000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE1 has 4000000 observations and 3 variables.
NOTE: PROZEDUR TRANSPOSE used (Total process time):
      real time           2.10 seconds
      cpu time            0.34 seconds

Note that's from a 2-core server, something considered very low in computing power for SAS.

 

Or you made an error in writing your transpose.

 

My suggested solution looks like this:

%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
c&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,1000000)

proc transpose data=have out=have1 (drop=_name_ rename=(col1=c));
by id;
var c:;
run;

proc sort data=have1;
by id c;
run;

data count;
set have1;
by id c;
if first.c
then count = 1;
else count + 1;
if last.c then output;
run;

proc sort data=count;
by id descending count descending c;
run;

data want;
set count;
by id;
if first.id;
run;

The whole test for 200 variables and 1 million records took this:

24         %macro make_have(no_vars,no_recs);
25         data have;
26         do id = 1 to &no_recs;
27         %do i = 1 %to &no_vars;
28         c&i = int(rand('uniform')*55);
29         %end;
30         output;
31         end;
32         run;
33         %mend;
34         %make_have(200,1000000)

NOTE: The data set WORK.HAVE has 1000000 observations and 201 variables.
NOTE: DATA statement used (Total process time):
      real time           19.98 seconds
      cpu time            10.29 seconds
      

35         
36         proc transpose data=have out=have1 (drop=_name_ rename=(col1=c));
37         by id;
38         var c:;
39         run;

NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables.
NOTE: PROZEDUR TRANSPOSE used (Total process time):
      real time           31.36 seconds
      cpu time            14.90 seconds
      

40         
41         proc sort data=have1;
42         by id c;
43         run;

NOTE: There were 200000000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           1:28.38
      cpu time            51.98 seconds
      

44         
45         data count;
46         set have1;
47         by id c;
48         if first.c
49         then count = 1;
50         else count + 1;
51         if last.c then output;
52         run;

NOTE: There were 200000000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           1:09.82
      cpu time            26.76 seconds
      

53         
54         proc sort data=count;
55         by id descending count descending c;
56         run;

NOTE: There were 53600561 observations read from the data set WORK.COUNT.
NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           35.41 seconds
      cpu time            16.39 seconds
      

57         
58         data want;
59         set count;
60         by id;
61         if first.id;
62         run;

NOTE: There were 53600561 observations read from the data set WORK.COUNT.
NOTE: The data set WORK.WANT has 1000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           14.40 seconds
      cpu time            4.04 seconds
 

At 31 seconds, you can see that the transpose will not pose a problem.

Alternatively, one can use a proc summary after the transpose:

24         proc summary data=have1 n nway;
25         by id;
26         class c;
27         var c;
28         output out=count (drop=_type_ c rename=(_freq_=count n_c=c)) n=n_c;
29         run;

NOTE: There were 200000000 observations read from the data set WORK.HAVE1.
NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables.
NOTE: PROZEDUR SUMMARY used (Total process time):
      real time           1:27.70
      cpu time            48.06 seconds

so you get a runtime of ~ 4 minutes for 1 million records with 200 variables

 

Note that the program (especially the datastep/sort version) makes no assumptions for

- number of variables

- number of possible values

- variable type

and is therefore completely data-driven.

ballardw
Super User

@jfaruqui wrote:

so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly)  ..

 

What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.

 

 


Almost any process that continually adds variables is poorly designed. Having to change code because of constantly added variables that are essentially the same as current ones means that the advantage of computers doing the routine and tedious work is partially negated.

 

Typically this approach seems to come from a spreadsheet oriented solution that is then "automated with manual changes" to account for poor data structures. There is a large body of work related to normalizing data for automated processing that may be of interest.

Ksharp
Super User

How about this one ?

 

data have;
input C1 C2 C3 C4 C5;
cards;
10 12 15 55 15
10 10 15 10 12
10 10 10 10 10
10 10 12 12 15 
;
run;
data want;
 if _n_=1 then do;
   declare hash h(ordered:'a');
   declare hiter hi('h');
   h.definekey('k');
   h.definedata('k','count');
   h.definedone();
 end;
set have;
array x{*} c1-c5;
do i=1 to dim(x);
 k=x{i};
 rc=h.find();
 if rc=0 then do;count=count+1;h.replace();end;
  else do;count=1;h.add();end;
end;
do while(hi.next()=0);
  if count>=max then do;max=count;RESULT=k; end;
end;
h.clear();
drop i k rc max count;
run;

P.S. If you can use SAS/IML, that could fast you a lot more and save you a lot code.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1002 views
  • 4 likes
  • 5 in conversation