Solved: Re: Count of times repeated alternative to proc freq?

Cruise · Posted 06-28-2018 10:32 AM

Is there a way to create frequency of repetition of duplicates by certain variable? I know proc freq and merge it back to an original data. But I'm wondering is there any more efficient way doing the same thing? I couldn't figure it out yet using first.obs and last.obs features.

data b.file;
input pat_id drug $ count;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;

novinosrin · Posted 06-28-2018 10:37 AM

Do you mean this?

data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;

proc sql;
create table want as
select pat_id,drug, count(pat_id) as count
from file
group by pat_id;
quit;

View solution in original post

ballardw · Posted 06-28-2018 10:36 AM

What do you expect to show as a result from that data? It is not at all clear what you want.

novinosrin · Posted 06-28-2018 10:37 AM

Do you mean this?

data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;

proc sql;
create table want as
select pat_id,drug, count(pat_id) as count
from file
group by pat_id;
quit;

Cruise · Posted 06-28-2018 10:44 AM

@novinosrin

Awesome. And I can keep all the variables since I want to keep the structure of original data, correct. Thanks a lot.

proc sql;

create table b.want as

select *, count(pat_id) as count

from b.file

group by pat_id;

quit;

Kurt_Bremser · Posted 06-28-2018 10:44 AM

With all methods, you have to first create the sums and then merge back. SQL does it transparently, but the overall work by SAS is mostly the same, as is the time that it will take.

If the file is sorted, you can use a double DOW:

data have;
input pat_id drug $ count;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
run;

data want;
newcount = 0;
do until (last.pat_id);
  set have;
  by pat_id;
  newcount + 1;
end;
do until (last.pat_id);
  set have;
  by pat_id;
  output;
end;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 06-28-2018 10:47 AM

I just run into this problem not being able to Like the posts. Do you know why?

FreelanceReinh · Posted 06-28-2018 11:32 AM

@Cruise wrote:

I just run into this problem not being able to Like the posts. Do you know why?

This seems to be a general issue. At the moment, I can't give likes either. Not sure if the moderators (@ChrisHemedinger and colleagues) are aware of it.

ChrisHemedinger · Posted 06-28-2018 11:39 AM

Sorry about that -- we had an interloper that was coming in and spamming us with Likes. I know, seems innocuous, but not a cool behavior. We clamped down a bit, too much it seems. But you should be able to Like all you want now.

Become an Explorer! Join SAS Analytics Explorers to learn and complete challenges that earn rewards!

FreelanceReinh · Posted 06-28-2018 11:03 AM

The DOW loop technique suggested by Kurt Bremser has an additional advantage over PROC SQL (without a suitable ORDER BY clause): It ensures that observation order is maintained.

Cruise · Posted 06-28-2018 11:10 AM

Can you give an example of "Not Suitable ORDER BY clause" situation?

FreelanceReinh · Posted 06-28-2018 11:22 AM

@Cruise wrote:
Can you give an example of "Not Suitable ORDER BY clause" situation?

Well, when I run @novinosrin's PROC SQL step, I get this result:

pat_id    drug    count

   3      009       2
   3      648       2
   4      504       2
   4      569       2
   5      003       1

As you see, the order of the first two observations has changed (as compared to the input dataset). Since dataset FILE is not sorted by DRUG within PAT_ID BY groups (neither ascending nor descending), there is no obvious ORDER BY clause which would be suitable to maintain the original sort order (if this was an issue).

Cruise · Posted 06-28-2018 11:24 AM

I see. Thanks.

novinosrin · Posted 06-28-2018 11:27 AM

Try this for fun and test the performance against DOW

data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;


data want;
 dcl hash h(suminc: "c", dataset:"file");
 h.defineKey('pat_id');
 h.defineDone();
 c = 1;
 do while (not last);
   set file  end=last;
   rc = h.find();
 end;
 last = 0;
 do while (not last);
   set file end=last;
   rc = h.sum(sum: count);
   output;
 end;
 stop;
 keep pat_id drug  count;
run;

Cruise · Posted 06-28-2018 11:32 AM

Great solutions going on here while I lost LIKE button. I really want my LIKE button back. Please let me know if someone knows why LIKE is lost?

FreelanceReinh · Posted 06-28-2018 12:49 PM

@novinosrin wrote:

Try this for fun and test the performance against DOW

data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;


data want;
 dcl hash h(suminc: "c", dataset:"file");
 h.defineKey('pat_id');
 h.defineDone();
 c = 1;
 do while (not last);
   set file  end=last;
   rc = h.find();
 end;
 last = 0;
 do while (not last);
   set file end=last;
   rc = h.sum(sum: count);
   output;
 end;
 stop;
 keep pat_id drug  count;
run;

Good idea to test the performance. I've added a modified version of your hash solution to make it faster (by reading the input dataset only twice rather than three times):

data want;
dcl hash h(suminc: 'c');
h.defineKey('pat_id');
h.defineDone();
c = 1;
do until(last1);
  set have end=last1;
  h.ref();
end;
do until(last2);
  set have end=last2;
  h.sum(sum: count);
  output;
end;
stop;
drop c;
run;

Then I tested the solutions using this input dataset HAVE containing 80 million observations from 12,695,381 "patients" with 1 - 102 observations each (median: 10).

data have;
call streaminit(271828);
do _n_=1 to 80000000;
  drug=rand('normal');
  if drug<-1 then pat_id+1;
  output;
end;
run;

And the winner is (not quite surprisingly) -- @Kurt_Bremser's classic double DOW loop! (Actually, your hash solution uses a double DOW loop as well, that's why I say "classic".)

The WANT dataset (with 80 million obs.) was created in

11 s using the classic double DOW loop
19 s using PROC SQL
36 s using hash object with SUMINC and REF
50 s using hash object with SUMINC, DATASET and FIND

(with SAS 9.4 TS1M2 64 Bit on a Windows 7 Workstation).

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away