Is there a way to create frequency of repetition of duplicates by certain variable? I know proc freq and merge it back to an original data. But I'm wondering is there any more efficient way doing the same thing? I couldn't figure it out yet using first.obs and last.obs features.
data b.file;
input pat_id drug $ count;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
Do you mean this?
data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
proc sql;
create table want as
select pat_id,drug, count(pat_id) as count
from file
group by pat_id;
quit;
What do you expect to show as a result from that data? It is not at all clear what you want.
Do you mean this?
data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
proc sql;
create table want as
select pat_id,drug, count(pat_id) as count
from file
group by pat_id;
quit;
Awesome. And I can keep all the variables since I want to keep the structure of original data, correct. Thanks a lot.
proc sql;
create table b.want as
select *, count(pat_id) as count
from b.file
group by pat_id;
quit;
With all methods, you have to first create the sums and then merge back. SQL does it transparently, but the overall work by SAS is mostly the same, as is the time that it will take.
If the file is sorted, you can use a double DOW:
data have;
input pat_id drug $ count;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
run;
data want;
newcount = 0;
do until (last.pat_id);
set have;
by pat_id;
newcount + 1;
end;
do until (last.pat_id);
set have;
by pat_id;
output;
end;
run;
I just run into this problem not being able to Like the posts. Do you know why?
@Cruise wrote:
I just run into this problem not being able to Like the posts. Do you know why?
This seems to be a general issue. At the moment, I can't give likes either. Not sure if the moderators (@ChrisHemedinger and colleagues) are aware of it.
Sorry about that -- we had an interloper that was coming in and spamming us with Likes. I know, seems innocuous, but not a cool behavior. We clamped down a bit, too much it seems. But you should be able to Like all you want now.
The DOW loop technique suggested by Kurt Bremser has an additional advantage over PROC SQL (without a suitable ORDER BY clause): It ensures that observation order is maintained.
@Cruise wrote:
Can you give an example of "Not Suitable ORDER BY clause" situation?
Well, when I run @novinosrin's PROC SQL step, I get this result:
pat_id drug count 3 009 2 3 648 2 4 504 2 4 569 2 5 003 1
As you see, the order of the first two observations has changed (as compared to the input dataset). Since dataset FILE is not sorted by DRUG within PAT_ID BY groups (neither ascending nor descending), there is no obvious ORDER BY clause which would be suitable to maintain the original sort order (if this was an issue).
Try this for fun and test the performance against DOW
data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
data want;
dcl hash h(suminc: "c", dataset:"file");
h.defineKey('pat_id');
h.defineDone();
c = 1;
do while (not last);
set file end=last;
rc = h.find();
end;
last = 0;
do while (not last);
set file end=last;
rc = h.sum(sum: count);
output;
end;
stop;
keep pat_id drug count;
run;
Great solutions going on here while I lost LIKE button. I really want my LIKE button back. Please let me know if someone knows why LIKE is lost?
@novinosrin wrote:
Try this for fun and test the performance against DOW
data file; input pat_id drug $ ; cards; 3 648 2 3 009 2 4 504 2 4 569 2 5 003 1 ; data want; dcl hash h(suminc: "c", dataset:"file"); h.defineKey('pat_id'); h.defineDone(); c = 1; do while (not last); set file end=last; rc = h.find(); end; last = 0; do while (not last); set file end=last; rc = h.sum(sum: count); output; end; stop; keep pat_id drug count; run;
Good idea to test the performance. I've added a modified version of your hash solution to make it faster (by reading the input dataset only twice rather than three times):
data want;
dcl hash h(suminc: 'c');
h.defineKey('pat_id');
h.defineDone();
c = 1;
do until(last1);
set have end=last1;
h.ref();
end;
do until(last2);
set have end=last2;
h.sum(sum: count);
output;
end;
stop;
drop c;
run;
Then I tested the solutions using this input dataset HAVE containing 80 million observations from 12,695,381 "patients" with 1 - 102 observations each (median: 10).
data have;
call streaminit(271828);
do _n_=1 to 80000000;
drug=rand('normal');
if drug<-1 then pat_id+1;
output;
end;
run;
And the winner is (not quite surprisingly) -- @Kurt_Bremser's classic double DOW loop! (Actually, your hash solution uses a double DOW loop as well, that's why I say "classic".)
The WANT dataset (with 80 million obs.) was created in
(with SAS 9.4 TS1M2 64 Bit on a Windows 7 Workstation).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.