turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- programming help

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-11-2017 04:18 AM

so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly) .. Each variable having a 2 digit value. For example:

C1 C2 C3 C4 C5 .. ..

10 12 15 55 15 .. ..

10 10 15 10 12 .. ..

10 10 10 10 10 .. ..

10 10 12 12 15 .. ..

.. .. .. .. .. .. ..

.. .. .. .. .. .. ..

Now I want to make and populate a result variable which basically read each row and outputs the value that has the most occurrences. For example:

C1 C2 C3 C4 C5 .. .. RESULT

10 12 15 55 15 .. .. 15

10 10 15 10 12 .. .. 10

10 10 10 10 10 .. .. 10

10 10 12 12 15 .. .. 12 (higher value in case of a tie)

.. .. .. .. .. .. ..

.. .. .. .. .. .. ..

What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.

Thanks in advance !!!

Accepted Solutions

Solution

10-11-2017
05:20 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 04:58 AM - edited 10-11-2017 05:08 AM

Hello,

```
data want;
set have;
array C(*) C:;
array a(100) a100-a1;
drop i a:;
call missing(of a(*));
do i=1 to dim(C);
a(100-C(i))+1;
end;
idx=100-whichn(max(of a(*)), of a(*));
run;
```

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 04:24 AM

First step: transpose from a wide to a long format:

10 12 15 55 15

to

10 12 15 55 15

sort (within your groups that identify original lines)

10 12 15 15 55

Now you can get a count with (eg) proc summary, and retrieve the highest count for each group.

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

Maxims of Maximally Efficient SAS Programmers

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to KurtBremser

10-11-2017 04:31 AM

Hi Kurt ..

transpose is out of the question as there are millions of rows (which would be turned into columns in a transpose) .. i could go on a nice long vacation while the program runs .. can't transpose .. tried it with a small subset of the data (20,000 rows) .. got tired of twiddling my thumbs )

Solution

10-11-2017
05:20 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 04:58 AM - edited 10-11-2017 05:08 AM

Hello,

```
data want;
set have;
array C(*) C:;
array a(100) a100-a1;
drop i a:;
call missing(of a(*));
do i=1 to dim(C);
a(100-C(i))+1;
end;
idx=100-whichn(max(of a(*)), of a(*));
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to gamotte

10-11-2017 05:23 AM

Hi Gamotte

Thanks so much for this awesome solution ..

can you please explain the significance of '100' and the 'call missing' ??

thanx

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 05:35 AM

jfaruqui wrote:

Hi Gamotte

Thanks so much for this awesome solution ..

can you please explain the significance of '100' and the 'call missing' ??

thanx

call missing() initializes the new array, which is necessary, because arrays are automatically retained.

@gamotte's program needs to know the number of variables beforehand, so that's the reason for using 100. But you can get that value by running a preliminary step:

```
data _null_;
set have;
array vars{*} C:;
call symput('nvars',put(dim(vars),best.));
stop;
run;
```

so you can use &nvars in the next step in place of a numeric constant.

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

Maxims of Maximally Efficient SAS Programmers

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 05:40 AM

call missing is used to reinitialize the array for each observation.

The idea is to create an array of size 100 which will store counters for each possible values of

the C<i> variables. Since you said that they are on two digits, possible values are 00 to 99.

When C<i>=10 for instance the counter a10 is increased.

Since you added the condition that incase of tie, the maximum shoud be retained and

because the whichn function returns the first match, I reversed the order of columns :

a100 to a1 instead of a1 to a100.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 07:27 AM

A transpose of 20000 recs is lightning fast. If you actually found the time to twiddle your thumbs, you should replace your 20-year old computer with something more modern:

```
%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
v_&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,20000)
proc transpose data=have out=have1 (rename=(col1=v));
by id;
var v_:;
run;
```

Log:

24 %macro make_have(no_vars,no_recs); 25 data have; 26 do id = 1 to &no_recs; 27 %do i = 1 %to &no_vars; 28 v_&i = int(rand('uniform')*55); 29 %end; 30 output; 31 end; 32 run; 33 %mend; 34 %make_have(200,20000) NOTE: The data set WORK.HAVE has 20000 observations and 201 variables. NOTE: DATA statement used (Total process time): real time 1.15 seconds cpu time 0.18 seconds 35 36 proc transpose data=have out=have1 (rename=(col1=v)); 37 by id; 38 var v_:; 39 run; NOTE: There were 20000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.HAVE1 has 4000000 observations and 3 variables. NOTE: PROZEDUR TRANSPOSE used (Total process time): real time 2.10 seconds cpu time 0.34 seconds

Note that's from a 2-core server, something considered very low in computing power for SAS.

Or you made an error in writing your transpose.

My suggested solution looks like this:

```
%macro make_have(no_vars,no_recs);
data have;
do id = 1 to &no_recs;
%do i = 1 %to &no_vars;
c&i = int(rand('uniform')*55);
%end;
output;
end;
run;
%mend;
%make_have(200,1000000)
proc transpose data=have out=have1 (drop=_name_ rename=(col1=c));
by id;
var c:;
run;
proc sort data=have1;
by id c;
run;
data count;
set have1;
by id c;
if first.c
then count = 1;
else count + 1;
if last.c then output;
run;
proc sort data=count;
by id descending count descending c;
run;
data want;
set count;
by id;
if first.id;
run;
```

The whole test for 200 variables and 1 million records took this:

24 %macro make_have(no_vars,no_recs); 25 data have; 26 do id = 1 to &no_recs; 27 %do i = 1 %to &no_vars; 28 c&i = int(rand('uniform')*55); 29 %end; 30 output; 31 end; 32 run; 33 %mend; 34 %make_have(200,1000000) NOTE: The data set WORK.HAVE has 1000000 observations and 201 variables. NOTE: DATA statement used (Total process time): real time 19.98 seconds cpu time 10.29 seconds 35 36 proc transpose data=have out=have1 (drop=_name_ rename=(col1=c)); 37 by id; 38 var c:; 39 run; NOTE: There were 1000000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables. NOTE: PROZEDUR TRANSPOSE used (Total process time): real time 31.36 seconds cpu time 14.90 seconds 40 41 proc sort data=have1; 42 by id c; 43 run; NOTE: There were 200000000 observations read from the data set WORK.HAVE1. NOTE: The data set WORK.HAVE1 has 200000000 observations and 2 variables. NOTE: PROZEDUR SORT used (Total process time): real time 1:28.38 cpu time 51.98 seconds 44 45 data count; 46 set have1; 47 by id c; 48 if first.c 49 then count = 1; 50 else count + 1; 51 if last.c then output; 52 run; NOTE: There were 200000000 observations read from the data set WORK.HAVE1. NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 1:09.82 cpu time 26.76 seconds 53 54 proc sort data=count; 55 by id descending count descending c; 56 run; NOTE: There were 53600561 observations read from the data set WORK.COUNT. NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables. NOTE: PROZEDUR SORT used (Total process time): real time 35.41 seconds cpu time 16.39 seconds 57 58 data want; 59 set count; 60 by id; 61 if first.id; 62 run; NOTE: There were 53600561 observations read from the data set WORK.COUNT. NOTE: The data set WORK.WANT has 1000000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 14.40 seconds cpu time 4.04 seconds

At 31 seconds, you can see that the transpose will not pose a problem.

Alternatively, one can use a proc summary after the transpose:

24 proc summary data=have1 n nway; 25 by id; 26 class c; 27 var c; 28 output out=count (drop=_type_ c rename=(_freq_=count n_c=c)) n=n_c; 29 run; NOTE: There were 200000000 observations read from the data set WORK.HAVE1. NOTE: The data set WORK.COUNT has 53600561 observations and 3 variables. NOTE: PROZEDUR SUMMARY used (Total process time): real time 1:27.70 cpu time 48.06 seconds

so you get a runtime of ~ 4 minutes for 1 million records with 200 variables

Note that the program (especially the datastep/sort version) makes no assumptions for

- number of variables

- number of possible values

- variable type

and is therefore completely data-driven.

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

Maxims of Maximally Efficient SAS Programmers

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 10:29 AM

jfaruqui wrote:

so I have a huge dataset(millions of rows) with over 100 variables (and more get added regularly) ..

What would be the most efficient code to write for this allowing for the possibility of more test columns(with values) to be added to the dataset.

Almost any process that continually adds variables is poorly designed. Having to change code because of constantly added variables that are essentially the same as current ones means that the advantage of computers doing the routine and tedious work is partially negated.

Typically this approach seems to come from a spreadsheet oriented solution that is then "automated with manual changes" to account for poor data structures. There is a large body of work related to normalizing data for automated processing that may be of interest.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jfaruqui

10-11-2017 10:53 AM

How about this one ?

```
data have;
input C1 C2 C3 C4 C5;
cards;
10 12 15 55 15
10 10 15 10 12
10 10 10 10 10
10 10 12 12 15
;
run;
data want;
if _n_=1 then do;
declare hash h(ordered:'a');
declare hiter hi('h');
h.definekey('k');
h.definedata('k','count');
h.definedone();
end;
set have;
array x{*} c1-c5;
do i=1 to dim(x);
k=x{i};
rc=h.find();
if rc=0 then do;count=count+1;h.replace();end;
else do;count=1;h.add();end;
end;
do while(hi.next()=0);
if count>=max then do;max=count;RESULT=k; end;
end;
h.clear();
drop i k rc max count;
run;
```

P.S. If you can use SAS/IML, that could fast you a lot more and save you a lot code.